Since the first computer was created, optimizing the use of its processing power has been a priority, in both the academic and commercial world. Additionally, as multi-user operating systems permitted multiple users to share a single processing system, the question of how most effectively to share the processing resource among multiple users and multiple jobs became critical.
For computers having a single processor, most operating systems utilize some form of time-slicing, or time-sharing. In this scheme, the utilization of the processing resource is divided into small units of time. The operating system then determines the various jobs that wish to use the processing unit. These jobs may include user programs, device drivers, such as print spoolers, and system applications. During each slice or quantum of time, the operating system selects a specific job to be executed. Once that time quantum is completed, the operating system typically saves the current state of the executing job, and selects another job for execution. The process by which the operating system selects a particular job is implementation specific. Some operating systems may employ a round robin system, whereby each pending job is selected in a predetermined order, with each receiving an approximately equal amount of processing time. Other operating systems may employ a weighed round robin system, whereby certain jobs are considered to have higher (or lower) priority, and thus get an unequal amount of processing time, relative to the average.
In the case of a single processor system, the issues of scheduling jobs and determining priority and runtime are the responsibility of the operating system. Since the prioritization and scheduling is all performed by the operating system, the scheduler is said to be a single-level scheduler.
In multiprocessor systems, the problem of dividing the processors among the various jobs is more complex. A multiprocessor system is any machine or group of machines that has a collection of computational engines. The computational engine might be a physical processor, a virtual processor, a thread or a process. Thus a collection of processors might be a multicore chip, a hyperthreaded machine, a shared memory multiprocessor, a NUMA machine, a grid, a distributed memory machine, a VLIW processor, or a peer-to-peer machine (such as when computers are donated on the internet to a common process), etc. While the time-sharing approach described above is still somewhat applicable, it may not completely comprehend the parallelism of the system, or of the individual jobs, each of which may be made up of one or more serial tasks. Another method of job scheduling in multiprocessor system is known as space-sharing. In this mode, rather than dividing the jobs with respect to quanta of processing time, jobs are scheduled on sets of processors. In the simplest scheme, each job is assigned P/J processors on which to execute its tasks, where J is the number of jobs and P is the number of processors. This scheme is called equipartitioning. Note that using this, or other space-sharing algorithms, some of the jobs may receive non-integral allocations. For example, a job may receive 3.5 processors. In this case, the job has 3 full processors and one processor for half the time. Therefore, even though the overall scheme is space-sharing, it might have a time-sharing component in it. However, this simple scheme may prove to be unfair, or inefficient. A scheme is defined to be unfair if one or more jobs have fewer processors than they can utilize while other jobs have at least one more processor than this deprived job. A scheme is said to be inefficient if there are idle processors and some jobs have fewer processors than they can use. In other words, if the division of processors among jobs results in one or more jobs receiving more processors than they can use, and some jobs having fewer processors than they can use, the scheme is inefficient.
For multiple processor systems, there are a number of methods that can be used to allocate the processing resources between the various jobs. In one scenario, the operating system determines the partitioning of resources, as well as selecting the tasks of the job that will be run. Thus, similar to single processor systems, this is known as single level scheduling. However, single level scheduling has a number of known deficiencies. Specifically, although the operating system is aware of the system configuration and the number of available processors, it is typically unaware of the computational nature of the jobs that it is scheduling. For example, the job's parallelism may vary over time, as shown in FIG. 1. FIG. 1 shows an exemplary diagram of a parallel job, displayed as a directed acyclic graph of tasks. The tasks may be implemented by operating system or user level threads or by fibers or by any other means. A task may be as short as one instruction or may be composed of many instructions. Each of the bubbles represents a task, and the arrows between the bubbles represent interdependencies. As shown in FIG. 1, a job may begin as a single task v1. During its execution, a task may spawn (create) or enable one or more new tasks. For examples, tasks v2 and v3 execute after task v1 spawns or enables them. Once enabled, tasks v2 and v3 can execute independently in parallel. Additionally, it is possible that a task, such as v2, also spawns additional tasks, such as V4, V5, and v6. At some later point, a task may require the results of another task before it begin its execution. For example, task v8 requires the results of tasks v5, v6 and v7 before it can execute. The complexity of FIG. 1 illustrates several important properties of parallel jobs. First, job parallelism is a dynamic attribute. Note that originally, the job consisted of only v1, the lone initial task, and its parallelism was just 1. Later, depending on the allocation of processors and the execution of each task, it is possible that several tasks could be simultaneously executing, and the job might have a high parallelism. Furthermore, there may be interdependencies between the various tasks of a job. For example, task v8 cannot execute and remains blocked until tasks v5, v6 and v7 finish. Typically, the operating system is unaware of the interdependencies and the parallelism of a particular job.
To overcome this limitation, some embodiments separate the scheduling task into two distinct parts. The first part, preferably performed by the operating system, performs the allocation of processors among the various requesting jobs. This part is referred to as the job scheduler. The second part, preferably performed by the jobs themselves, performs the determination of parallelism and the selection of which tasks are to be executed on the allocated processors. This part is referred to as the task scheduler. This configuration, utilizing two distinct components, is known as a two-level scheduler.
Schedulers can be defined as static or dynamic. A static scheduler allocates a particular number of processors to a job, and that allocation remains unchanged until the job has completed. For jobs whose parallelism is unknown in advance or may change during execution, this approach may waste processor cycles. If the job has more processors than it can effectively use, these processors will be underutilized. More advanced schedulers, or dynamic schedulers, are able to modify the allocation of processors to a particular job during the execution of that job. Furthermore, allocations may be changed based on factors external to the particular job. For example, as new jobs enter the system, processors may be re-allocated away from an existing job to allow the best overall system efficiency. The frequency with which the allocation of processors is adjusted is implementation specific. Typically, the time period must be short enough to effectively adapt to changes in the system performance and job parallelism. However, the time period should be long enough to amortize the overhead associated with the re-allocation process.
The job scheduler and the task scheduler interact to schedule the jobs on the processors. The task scheduler makes its requirements known to the job scheduler, which then balances all of the incoming requirements to determine its allocation. The job scheduler then informs the task schedulers of their allocation for the next time quantum. This allocation is typically equal to or less than the task scheduler's requirements.
One common algorithm used by job schedulers to allocate the available processors is known as dynamic equipartitioning. It is a modification of the equipartitioning algorithm described above. This algorithm attempts to maintain an equal allotment of processors between jobs, with the upper constraint that a job is never allocated more processors than it requests. For example, assume a 16-processor system with 4 jobs. With no other information, an equipartitioning algorithm will allocate 4 processors to each job. Suppose one of those jobs requests only a single processor. In that case, that job will receive one processor, and the remaining 15 processors will be divided among the remaining 3 jobs. Thus, each of the other jobs will receive an allocation of 5 processors. Similarly, if two jobs had each requested 2 processors; the remaining 2 jobs will equally share the remaining 12 processors. While equipartitioning is neither fair nor efficient, dynamic equipartitioning can be both fair and efficient.
While dynamic equipartitioning improves the allocation of available processors to the jobs, it requires an estimate of the job's processor requirements. Without such an input, the algorithm will simply divide the number of processors by the number of jobs, and allocate the same numbers of processors to each job, and behave like equipartitioning. Thus, to best allocate the processors, the job scheduler must have information concerning each job's processor requirements, also known as the job's desire. In one embodiment, the task scheduler typically generates the job's desire. Then, each task scheduler supplies the job scheduler with its desire and the job scheduler uses this information to perform the processor allocation. In another embodiment, the job scheduler calculates desire itself.
Using this desire information, it performs the allocation process and reallocates the processors to the requesting jobs. In one embodiment, this reallocation process is conducted at regular time intervals, or quanta. In another embodiment, this reallocation process is performed whenever a job encounters an event that modifies its desire. For example, the creation of a new task or the completion of a task would alter the job's desire and thereby invoke a new reallocation process.
Work can be created in a number of ways. First, new jobs can enter the computer system. Additionally, currently executing tasks may spawn new tasks, thereby creating more work. Typically, a job will have one or more ready queues, which are queued entries describing tasks that are ready and available for execution. Preferably, the number of ready queues is matched by the allocation of processors, such that each processor is assigned a ready queue. If the allocation of processors is less than the number of ready queues, one or more ready queues are left dormant or asleep, since there are no available processing resources to attend to them. There are two common methods by which work can be distributed.
The first method is known as work-sharing. In this scenario, a central dispatcher is responsible for distributing the work among the various ready queues. In this model, the dispatcher may query each of the executing processes to determine the state of its ready queue. Based on this feedback, the dispatcher selects the ready queue on which the new task will be placed. This decision may be based on estimated start time, the number of entries on the various ready queues or other parameters. While this model attempts to keep the load between processors balanced, it is not without major disadvantages. The major drawback is that to obtain the required information, the dispatcher is required to communicate with each of the executing processes. In a multiprocessor system, communication between processes should be minimized as much as possible, since it causes not just one, but multiple, processors to suspend the useful work that they were performing.
The second work distribution algorithm is known as work-stealing. As described above, each active processor preferably has a ready queue associated with it. Assume that a particular processor is currently executing a task. As that task spawns new tasks, those newly created tasks are pushed onto the associated ready queue. In another embodiment, the old thread is pushed onto the ready queue and the newly created task begins execution. Thus, the creation of a new task adds an entry to the ready queue of that processor. Similarly, when a task completes execution, the next entry on the ready queue is popped off and executed. Thus, termination or completion of a task results in the removal of an entry from the ready queue. A processor will remain busy as long as there are items on its associated ready queue. Since the ready queues are managed in a distributed manner, there is no system overhead associated with the creation or removal of tasks from the system. However, it is possible that a processor executes all of the items listed on its associated ready queue. In this scenario, the idle processor becomes a thief. It selects another processor's ready queue (called the victim) and attempts to obtain work by removing an item (or stealing) from that processor's ready queue. If there is work available, this is a successful steal, and the formerly idle processor begins working on the new task. If there is no work available, this is an unsuccessful steal, and the idle processor selects another victim and repeats the process.
The above described mechanisms and algorithms provide a basis for designing a scheduler for use with a multiprocessor system. These mechanisms and algorithms can be combined to create a multitude of schedulers, each with its own performance metrics. However, the most important consideration is how these algorithms are employed and modified to achieve optimal performance. A number of parameters can be used to measure the performance of a multiprocessor scheduler. Some examples include total time to completion (as compared to a theoretical minimum), number of wasted processor cycles or quanta, and number of steal cycles. Minimization of these parameters is typically an important goal of any scheduler.