A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The copyright owner does not hereby waive any of its rights to have this patent document maintained in secrecy, including without limitation its rights pursuant to 37 C.F.R. xc2xa7 1.14.
The present invention relates to the allocation and scheduling of processors in a multiprocessing computer system, and more particularly to a thread-scheduling invention which creates a strong affinity between each thread and the processor which is initially allocated to the thread.
Many computing systems contain a single central processing unit (xe2x80x9cCPUxe2x80x9d or xe2x80x9cprocessorxe2x80x9d), a primary storage unit directly accessible to the processor, and a secondary storage unit for long-term bulk storage of information. The primary storage typically includes random access memory (xe2x80x9cRAMxe2x80x9d) and the secondary storage typically includes a magnetid disk, optical disk, or similar-device.
To create more powerful computing systems, these individual architectural componentsxe2x80x94processors, memories, and disksxe2x80x94have been and are being combined and connected in various ways. A major goal of these alternative architectures is to support parallel processing, that is, processing performed by several processors which are working on different pieces of a given problem at the same time. A parallel processing system is said to be xe2x80x9cscalablexe2x80x9d if adding additional processors clearly improves the system""s performance.
Some parallel processing architectures are generally termed xe2x80x9cmultiprocessorsxe2x80x9d or xe2x80x9cmultiprocessing systems.xe2x80x9d Multiprocessors contain at least two processors which communicate with one another through a xe2x80x9cshared memory.xe2x80x9d Shared memory makes a single virtual address space available to multiple processors, allowing each processor to read and write values at address locations that are accessible to all processors and addressed identically by each processor. Each processor in a multiprocessing system may also have a local private memory, known as its xe2x80x9ccache,xe2x80x9d which is not shared with the other processors.
Multiprocessors may be connected to each other and/or to single processor systems in a local area network, wide area network, on the Internet, or by other means. Processors which communicate with each other but do not have a common shared memory form a xe2x80x9cmulticomputing system.xe2x80x9d Thus, a local area network is one of many possible types of multicomputing systems. Multiprocessing systems and multicomputing systems are known collectively as xe2x80x9cdistributed systems.xe2x80x9d
Multiprocessors may be xe2x80x9cbus-basedxe2x80x9d or xe2x80x9cswitched.xe2x80x9d One bus-based multiprocessor is illustrated in FIG. 1. The multiprocessor, which is indicated generally at 10, includes four processors 12, each of which has its own cache 14. The caches communicate through signal lines 15 using MESI or another familiar protocol. The processors 12 communicate with one another through a shared memory unit 16 which is on a common bus. 17 with the processors 12. The shared memory unit 16 typically includes a memory bus controller and RAM.
The bus 17 also provides communication between the processors 12 and/or shared memory 16, on the one hand, and a drive 18 capable of reading a medium 19, on the other hand. Typical drives 18 include floppy drives, tape drives, and optical drives. Typical media 19 include magnetic and optical computer-readable media.
To read the value of a word from the memory 16, a particular processor such as CPU 1 puts the memory address of the desired word onto the bus 17 and signals that a read is desired. In response, the memory 16 places the value of the addressed word onto the bus 17, thereby allowing the processor CPU 1 to read the value. Writes to the shared memory 16 are performed in a similar way.
Unfortunately, if shared memory 16 reads and writes are performed only by using this simple approach, then performance of the multiprocessor 10 drops dramatically as additional processors 12 are added. When too many processors 12 are present, the bus 17 cannot transfer information between the processors 12 and the shared memory 16 as rapidly as requested by the processors 12. System performance then drops because some of the system""s processors 12 are idle while they wait for access to the shared memory 16.
To reduce the load on the bus 17, copies of the values read or written by a given processor such as CPU 1 may be kept in that processor""s cache 14. Each value is stored in the cache 14 with some indication of the address at which that value is kept in the shared memory 16. Addresses corresponding to values stored in the cache 14 are called xe2x80x9ccached addresses,xe2x80x9d while the values stored in the cache 14 are called xe2x80x9ccached values.xe2x80x9d If the address specified in a read request is a cached address, the corresponding cached value is read from the cache 14 and no request is placed on the bus 17.
Although caching may dramatically reduce the load on the bus 17, it also introduces potential inconsistencies. Imagine that processors CPU 1 and CPU 2 each read the word at address A0 from the shared memory 16 and that the value read is zero. Then the cache of CPU 1 and the cache of CPU 2 will each indicate that the value stored at address A0 is zero. Suppose CPU 1 then writes the value one to address A0 of the shared memory 16. Then the cache of CPU 1 and the shared memory 16 will each indicate that the value stored at address A0 is one, while the cache of CPU 2 will still indicate that the value stored at A0 is zero.
Using one or both of two approaches, known as xe2x80x9cwrite-through cachesxe2x80x9d and xe2x80x9csnooping caches,xe2x80x9d will prevent such inconsistencies on bus-based multiprocessing systems unless the number of processors is too large. If the number of processors grows too large, alternative architectures may be used. One alternative multiprocessing architecture, known as a xe2x80x9ccrossbar switch,xe2x80x9d is indicated generally at 20 in FIG. 2. A shared memory is divided into modules 22 which are connectable to processors 24 by signal lines 26. The signal lines 26 may be connected as needed by actuating appropriate crosspoint switches 28.
Another alternative multiprocessing architecture, known as an xe2x80x9comega switching network,xe2x80x9d is indicated generally at 30 in FIG. 3. Shared memory is again divided into modules 32 which are connectable to processors 34 by signal lines 36. The signal lines 36 may be connected as needed by actuating appropriate 2xc3x972 switches 38. In either the crossbar switch multiprocessor 20 (FIG. 2) or the omega multiprocessor 30 (FIG. 3), some or all of the processors 24, 34 may have a cache similar to the caches 14 in the bus-based multiprocessor 10 (FIG. 1). The multiprocessors 20, 30 may also include a drive such as the drive 18 (FIG. 1) for reading computer-readable media such as the medium 19.
Although its underlying hardware limits the theoretical performance of a multiprocessor, in practice limitations imposed by an xe2x80x9coperating systemxe2x80x9d are more frequently encountered. The operating system is software which (among other duties) controls access to the processors. The presence of multiple processors that are capable in theory of working in parallel on a given computational problem does not, in and of itself, make parallel processing a practical reality. The problem must be broken into appropriate parts, the parts must then be efficiently distributed among the processors, and the results of the separate computations must finally be combined to provide a solution to the problem.
Computational problems may be divided into xe2x80x9ctasks,xe2x80x9d each of which in turn contains one or more xe2x80x9cthreads.xe2x80x9d Each task has its own address space; the address spaces of separate tasks are typically disjoint. Tasks often have other components as well, such as global variables, associated files, communication ports, semaphores, and accounting information. Each thread has some executable code and a set of register values. The register values include a program counter value that indicates which point the thread has reached in its progress through the executable code. Threads may also have associated state information such as a function call stack.
A variety of approaches have been tried for allocating processors to tasks and threads. When the processing requirements of a problem are precisely known before computation to solve the problem is performed, deterministic approaches such as certain graph-theoretic algorithms can be used to efficiently allocate processors to threads or tasks which will collectively solve the problem. However, in most cases the information needed by deterministic approaches is not available until after the computations have finished.
Because deterministic approaches are rarely practical, a variety of non-deterministic xe2x80x9cheuristicxe2x80x9d approaches are used to allocate processors to threads and/or tasks. One centralized approach tries to allocate processors fairly among all waiting users. Under this approach, a user who is not currently using any processors but has been waiting a long time for a processor will always be given the next available processor. The usage information needed to allocate processors fairly is maintained in one central location. To increase the fairness of processor allocation, this approach sometimes stops a thread or task before it has finished using a given processor, saves appropriate state information, and then gives that processor to a different thread or task.
Under many allocation schemes, a given processor may be allocated to a group of threads rather than being dedicated to an individual thread. In such cases, steps must be taken to schedule the use of that processor by the individual threads in the group, since only one thread can run at a time on any particular processor. Deterministic scheduling approaches exist which theoretically optimize efficiency, but which are not practical because they require more information than is typically available.
One heuristic approach to processor scheduling in a multiprocessor system is embodied in the Mach operating system presently under development at Carnegie-Mellon University and elsewhere. Each processor is assigned to exactly one xe2x80x9cprocessor set.xe2x80x9d Processor sets are then allocated to threads. Each processor set therefore has a set of threads to execute, and steps must be taken to schedule use of the processors by the threads. Goals of Mach scheduling include assigning processor cycles to threads in a fair and efficient way while nonetheless recognizing different thread priorities.
Each thread has a priority level ranging from 0 (highest priority) to 31 (lowest priority). Each processor set has an associated array of global run queues. FIG. 4 illustrates an array 40 of global run queues 42 for a processor set P1. Each run queue 42 contains zero or more threads 44 waiting to use a processor in the processor set. Mach defines similar arrays for each of the other processor sets.
Each global run queue 42 corresponds to a different priority level. When a thread at a given priority is ready to run, it is placed at the end of the corresponding run queue. Threads which are not ready to run are not present on any of the run queues 42. In the example shown, a priority-three run queue 46 contains two priority-three threads 48 that are ready to run, and a priority-eight run queue 50 contains two priority-eight threads 52 which are ready to run. Two other run queues 42 also contain at least one thread 44; the remaining run queues 42 are presently empty.
Each Mach array 40 has three associated variables: an array mutex, a thread count, and a hint. The array mutex (derived from xe2x80x9cmutual exclusionxe2x80x9d) is used to lock the array 40 so that only one processor can access the run queues 42 at a time. The thread count holds the total number of threads 44 currently in the run queues 42 of the array 40. The hint holds a priority level indicating where a Mach scheduler should start looking for the highest priority thread. The highest priority thread will be located either in the run queue for the hint priority level or in a lower priority run queue.
The global run queues 42 may be used by each of the one or more processors in the processor set. In addition, each individual processor Pn has its own local run queues. The local run queues are similarly arranged in priority levels from zero through thirty-one. Each local run queue for processor Pn holds xe2x80x9cboundxe2x80x9d threads, namely, threads that are permanently bound to processor Pn. Typical bound threads include device drivers for I/O devices that are physically accessible only to processor Pn. Bound threads are never placed in one of the global run queues 42.
Mach utilizes the run queues 42 and other structures to perform processor scheduling as follows. Each thread 44 is allocated a maximum quantum of time during which it can have continual use of a processor. When a thread 44 blocks or exits voluntarily, or is preempted because it has run continually for one quantum, the scheduler searches certain run queues to locate the next thread 44 that will be given the processor. If a thread 44 is found at any time during this search, the processor is allocated to that thread 44 and the search ends.
The Mach scheduler looks first in the processor""s local run queues. If any threads 44 are found, the first thread 44 in the highest priority local run queue is given the processor. The check for threads 44 in the local run queues begins by checking to see whether the local thread count is zero. If it is, the local run queues are all empty. Otherwise, the scheduler uses the local hint value to find the first thread 44 in whichever non-empty local run queue has the highest priority.
If all of the local run queues are empty, then the same steps are repeated to search the global run queues 42 for the processor set that contains the processor. If there are no threads 44 in either the local run queues or the global run queues, and if a non-scheduler thread was not preempted to perform the search, then the scheduler repeats the search, possibly after waiting for some predefined period of time. If a ready-to-run thread 44 is located, that thread 44 is allowed to run for at most one time quantum. Then it is stopped and the whole search process is repeated.
Mach regularly decreases the priority of the currently running thread 44. Thus, the longer a thread 44 runs, the less successful it is likely to be when competing with other threads 44 for a processor. However, some threads 44 have a limited ability to temporarily increase their own priority, after which their original (lower) priority is restored. A thread 44 may also name another thread 44 as its successor. If a successor thread 44 is named, the local and global run queues are not searched. Instead, the processor is simply given to the successor for at most one quantum of time.
Mach""s approach to scheduling has two major drawbacks. First, Mach continually preempts threads which are doing useful work, sets them to one side, and then spends valuable processor time performing the searches just described. From a user""s perspective, the time spent searching is undesirable administrative overhead that decreases the overall performance of the multicomputing system.
The processor made to do the search is prevented from working on the user""s problem during the search. Moreover, the scheduler must lock the global run queues 42 while the search is performed. If other processors in the same processor set try to access the locked global run queues 42, they must wait until the first processor finishes. Thus, the search may reduce the efficiency of several processors even though it seeks a thread to run on just one processor.
The second drawback to Mach""s approach is even more destructive of multiprocessor efficiency. Under Mach, threads 44 tend to migrate from one processor to another processor over time. Bound threads (those in local run queues) only run on a particular processor, but load-balancing concerns traditionally limit such bound threads 44 to device drivers and other threads 44 that simply will not run on other processors. Most threads 44 are not bound, but are allowed to run on any available processor in the processor set.
Unfortunately, moving threads 44 between processors may severely degrade system performance because it undercuts the performance gains that would otherwise arise from processor cache usage. With reference to FIGS. 1 and 4, those of skill in the art will appreciate that running a thread 44 on a given processor 12 tends to fill that processor""s cache 14 with the data needed by the thread 44. Over time, the thread 44 therefore tends to receive data from the cache 14 rather than the shared memory 16. As discussed, the cache 14 thereby improves performance of the system 10 by reducing the load on the bus 17. Similar performance gains arise when local processor caches are used in other multicomputing systems, including the systems 20 and 30 shown in FIGS. 2 and 3, respectively.
Moving a thread 44 to a new processor forces the thread 44 to reacquire needed data from the shared memory 16, 22, 32. The data must be reloaded into the processor""s cache before the benefits of caching become available again. Indeed, the processor not only acts as though it had no cache during this reloading process, but actually performs worse than similar cache-less processors because of the need to reload the cache.
Thus, it would be an advancement in the art to provide a method and apparatus for thread scheduling which reduces the movement of threads between processors in a multiprocessor.
It would also be an advancement to provide such a method and apparatus which reduces the time during which processors in a multiprocessor are unable to work because thread scheduling is underway.
Such a method and apparatus for multiprocessor scheduling are disclosed and claimed herein.
The present invention provides a method and apparatus for scheduling the execution of a plurality of threads on a plurality of processors in a multiprocessor computer system. One method of the present invention includes associating an unlocked local queue of threads with each of the processors and maintaining a global dispatch queue of threads which are not presently associated with any processor. The unlocked local queue is accessed only by the processor in question and therefore requires no corresponding mutex or other semaphore to maintain its data integrity. Thus, the number of locks asserted by the multiprocessor""s operating system under the present invention is significantly less than under different approaches, and provides the multiprocessor with a corresponding performance increase.
The method of the present invention also includes selecting movable threads from the unlocked local queues according to predetermined criteria which tend to restrict the mobility of threads. A thread is moved from its unlocked local queue to the global dispatch queue only if different processors are facing very disparate loads. This creates a strong affinity between processors and threads, which in turn provides the multiprocessor with a performance boost by increasing processor cache usage and decreasing shared memory accesses.
In one embodiment of the present invention, the global dispatch queue is a lockable queue to prevent it from being changed by more than one thread at a time. Moving a selected thread is accomplished by locking the global dispatch queue, by then deleting the selected thread from its unlocked local queue and inserting it in the global dispatch queue, and finally unlocking the global dispatch queue. Locking and unlocking involve obtaining and releasing, respectively, a mutex variable that is associated with the global dispatch queue.
The selection of movable threads includes identifying a busiest processor among the plurality of processors. Movable threads are selected only from eligible-to-run threads in the unlocked local queue of the busiest processor. One embodiment identifies the busiest processor as that processor which has received the smallest number of sleep requests of any of the processors during a sampling period. Another embodiment identifies the busiest xe2x80x9cpopularxe2x80x9d processor among the plurality of processors. A processor is xe2x80x9cpopularxe2x80x9d when its unlocked local queue contains at least two threads which are eligible to run. The movable threads are then selected only from the eligible threads in the unlocked local queue of the busiest popular processor.
Selection of a thread to which control of an available processor will be yielded is accomplished by searching for a suitable thread until one is found and then switching the processor""s context to the new thread. One method of the present invention searches, in a predetermined order, at least a portion of the union of the global dispatch queue and the unlocked local queue of the processor to locate at least one eligible thread. Control of the processor is given to an eligible thread found during the search. One embodiment requires that control be yielded to at least one thread that was not found in the global dispatch queue between each pair of instances in which control is yielded to threads found in the global dispatch queue.
According to the present invention, one approach to searching includes checking the global dispatch queue for an eligible thread. If no eligible thread is found in the global dispatch queue, the searching step checks the unlocked local queue of the processor.
A second approach to searching may be used in embodiments of the present invention which associate a lockable local queue of threads with each of the processors. The lockable local queue is used rather than the unlocked local queue when other processors need to bind a thread, such as a device driver, to the given processor. The unlocked local queues are still present; use of the lockable local queues is typically rare. This alternative approach to searching includes checking the lockable local queue of the processor for an eligible thread, checking the global dispatch queue if no eligible thread is found in the lockable local queue, and then checking the unlocked local queue if no eligible thread is found in the global dispatch queue.
Under either approach, searching may also include determining whether checking the global dispatch queue will exceed a predetermined relative frequency for global dispatch queue accesses. The global dispatch queue is checked only if checking will not exceed the predetermined relative frequency for global dispatch queue accesses and, under the second approach, if no eligible thread is found in the lockable local queue.