In a multi-processor computer system, a scheduler subsystem is often employed to schedule threads for execution on the various processors. One major function of the scheduler subsystem is to ensure an even distribution of work among the processors so that one processor is not overloaded while others are idle.
In a modern operating system, such as the HP-UX® operating system by the Hewlett-Packard Company of Palo Alto, Calif., as well as in many modern Unix and Linux operating systems, the scheduler subsystem may include three components: the thread launcher, the thread balancer, and the thread stealer.
With reference to FIG. 1, kernel 102 may include, in addition to other subsystems such as virtual memory subsystem 104, I/O subsystem 106, file subsystem 108, networking subsystem 110, process management subsystem 112, a scheduler subsystem 114. As shown, scheduler subsystem 114 includes three components: a thread launcher 120, a thread balancer 122, and a thread stealer 124. These three components are coupled to a thread dispatcher 138, which is responsible for placing threads onto the processor's per-processor run queues as will be discussed herein.
Thread launcher 120 represents the mechanism for launching a thread on a designated processor, e.g., when the thread is started or when the thread is restarted after having been blocked and put on a per-processor run queue (PPRQ). As is known, a per-processor run queue (PPRQ) is a priority-based queue associated with a processor. FIG. 1 shows four example PPRQs 126a, 126b, 126c, and 126d corresponding to CPUs 128a, 128b, 128c, and 128d as shown.
In the PPRQ, threads are queued up for execution by the associated processor according to the priority value of each thread. In an implementation, for example, threads are put into a priority band in the PPRQ, with threads in the same priority band being queued up on a first-come-first-serve basis. For each PPRQ, the kernel then schedules the threads therein for execution based on the priority band value.
To maximize performance, thread launcher 120 typically launches a thread on the least-loaded CPU. That is, thread launcher 120 instructs thread dispatcher 138 to place the thread into the PPRQ of the least-loaded CPU that it identifies. Thus, at least one piece of data calculated by thread launcher 120 relates the least-loaded CPU ID, as shown by reference number 130.
Thread balancer 122 represents the mechanism for shifting threads among PPRQs of various processors. Typically, thread balancer 122 calculates the most loaded processor and the least loaded processor among the processors, and shifts one or more threads from the most loaded processor to the least loaded processor each time thread balancer 122 executes. Accordingly, at least two pieces of data calculated by thread balancer 122 relate to the most loaded CPU ID 132 and the least loaded CPU ID 134.
Thread stealer 124 represents the mechanism that allows an idle CPU (i.e., one without a thread to be executed in its own PPRQ) to “steal” a thread from another CPU. Thread stealer accomplishes this by calculating the most loaded CPU and shifts a thread from the PPRQ of the most loaded CPU that it identifies to its own PPRQ. Thus, at least one piece of data calculated by thread stealer 124 relates the most-loaded CPU ID. The thread stealer performs this calculation among the CPUs of the system, whose CPU IDs are kept in a CPU ID list 136.
In a typical operating system, thread launcher 120, thread balancer 122, and thread stealer 124 represent independently operating components. Since each may execute its own algorithm for calculating the needed data, e.g., least-loaded CPU ID 130, most-loaded CPU ID 132, least-loaded CPU ID 134, the most-loaded CPU among the CPUs in CPU ID list 136, and the algorithm may be executed based on data gathered at different times, each component may have a different idea about the CPUs at the time it performs its respective task. For example, thread launcher 120 may gather data at a time t1 and executes its algorithm, which results in the conclusion that the least loaded CPU is CPU 128c. Thread balancer 122 may gather data at a time t2 and executes its algorithm, which results in the conclusion that the least loaded CPU is a different CPU 128a. In this case, each of thread launcher 120 and thread balancer 122 may operate correctly according to its own algorithm. Yet, by failing to coordinate (i.e., by executing their own algorithms and/or gathering system data at different times), they arrive at different calculated values.
The risk is increased for an installed OS that has been through a few update cycles. If the algorithm in one of the components (e.g., in thread launcher 120) is updated but there is no corresponding update in another component (e.g., in thread balancer 122), there is a substantial risk that these two components will fail to arrive at the same calculated value for the same scheduling parameter (e.g., the most loaded CPU ID).
The net effect is rather chaotic and unpredictable scheduling by scheduler subsystem 114. For example, it is possible for thread launcher 120 to believe that CPU 128a is the least loaded and would therefore place a thread A on PPRQ 126a associated with CPU 128a for execution. If thread stealer 124 is not coordinating its effort with thread launcher 120, it is possible for thread stealer 124 to believe, based on the data it obtained at some given time and based on its own algorithm, that CPU 128a is the most loaded. Accordingly, as soon as thread A is placed on the PPRQ 126a for execution on CPU 128a, thread stealer 124 immediately steals thread A and places it on PPRQ 126d associated with CPU 128d. 
Further, if thread balancer 122 is not coordinating its effort with thread launcher 120 and thread stealer 124, it is possible for thread balancer 122 to believe, based on the data it obtained at some given time and based on its own algorithm, that CPU 128d is the most loaded and CPU 128a is the least loaded. Accordingly, as soon as thread A is placed on the PPRQ 126d for execution on CPU 128d, thread balancer 122 immediately moves thread A from PPRQ 126d back to PPRQ 126a, where it all started.
During this needless shifting of thread A among the PPRQs, the execution of thread A is needlessly delayed. Further, overhead associated with context switching is borne by the system. Furthermore, such needless shifting of threads among PPRQs may cause cache misses, which results in a waste of memory bandwidth. The effect on the overall performance of the computer system may be quite noticeable.