Conventional multiprocessor computer systems include multiple processing devices that are each capable of executing sequences of instructions, called threads, concurrently with other processing devices. The individual processing devices in a conventional multiprocessing computer system may be separate microprocessor chips interconnected via a data bus or other circuitry, or the processing devices may reside as respective “cores” (e.g., processing circuits) on a single “die” (i.e., one physical microchip). If each respective processing device is a separate physical chip, each chip may include instruction processing circuitry as well as an on-board memory or local cache and an associated cache controller. When multiple core processing devices reside on a single chip or die, each processing core may include its own respective instruction processing circuitry allowing concurrent execution of threads with other cores on the same chip, but the cores on the same die may share a common cache of memory. Examples of multiprocessing computer systems are workstations manufactured by Sun Microsystems of Palo Alto, Calif., USA that can contain as many as 256 Scalable Processor Architecture (SPARC) processors. An example of a multiprocessing computer chip containing multiple processing cores on a single processor die that share a common cache is the Intel Pentium-4 line of microprocessors containing Hyper-Threading technology. Pentium-4 and Hyper-Threading are registered U.S. trademarks of Intel Corporation.
Generally, a conventional processing device often uses a local cache memory system to store data and other information related to the execution of a particular thread of instructions that execute on that processing device. As the processing device executes the thread of instructions of a software program, a cache controller in the processing device stores data, such as values for variables and/or other execution state information associated with the thread, in the cache for faster access when this information is needed during execution of that thread.
Conventional operating systems that operate within multiprocessing computer systems include a kernel that is capable of scheduling various threads that are ready for execution on the processing devices. Generally, the kernel can provide an execution time slot for a thread to execute on a processing device. When the time slot or an executing thread has expired, or if some other event such as an interrupt or a change in thread priority occurs, the kernel can remove or preempt the executing thread from execution on the processing device and can select and resume execution of another thread on that processing device. The kernel can perform this repetitive scheduling process involving thread selection and execution in a continuous manner for all processing devices in the multiprocessing computer system so that when a thread on one processing device is blocked from execution for some reason (e.g., because its timeslot ended, or it became blocked awaiting access to shared memory or an input-output device or for some other reason), the kernel can select another thread for execution on that processing device. In this manner, each thread of instructions that is awaiting its chance to execute is scheduled by the kernel to execute on a processing device.
A thread may execute on one processing device for one period of time until thread preemption occurs. When the kernel schedules that thread for execution again, the kernel may execute that thread on different processing device than the original processing device that formerly executed that thread. This is called thread migration. Thread migration may happen, for example, if at the time of scheduling the thread for re-execution, the original processing device is now busy executing another thread but the different processing device is now available for execution of a thread.
After preemption of a thread from execution on a processing device, the cache memory associated with that processing device continues to maintain or store execution state of the pre-empted thread until that specific cache memory space is overwritten during execution of another executing thread (i.e., executing on the same processing device, or on another core that also uses that cache area). Due to the large size of modern cache memory systems associated with conventional processing devices, there may be areas of the cache containing state information from execution of a first thread that are not overwritten during execution of a second thread. In other words, just because a kernel preempts a first thread and causes a processing device to execute a second thread, upon preemption of the second thread, portions of the cache associated with that processing device may still contain some or even all of the state information stored during the former execution of the first thread. Modern operating system designers have recognized this fact and have created kernels that provide for “affinity-based” scheduling.
Generally, conventional affinity-based thread scheduling recognizes that there is a good chance that a pre-empted thread may be able to re-use some execution state information maintained in a cache (i.e., that was not overwritten during execution of another thread) if that thread is again executed on a processing device associated with that cache. As such, using conventional affinity-based scheduling, a thread that executes on the same processing device as was used for prior execution of that same thread may avoid cache “misses” that would otherwise be needed to populate the cache with the threads “working data”.
When a thread resumes execution on a processing device (e.g., a CPU), that thread experiences a “cache reload transient” (CRT) as it builds up its “working set” of data the cache of the processor that is currently executing that thread. During the CRT time period the thread incurs a higher cache miss rate than it would if the thread had simply remained executing on the CPU (and had not blocked itself or been preempted). If the thread executed recently on that same CPU prior to preemption, then statistically the CRT should be short/small. Affinity-based scheduling attempts to minimize the CRT penalty. When a thread running on processing device “A” causes a cache miss, the cache line might be in main memory, or alternately if the thread ran recently on some other processing device “B,” the cache line might be in write-back cache attached to processor “B”. In this case, processor “B” can either perform a cache-to-cache transfer of the cache line from processor “B” to processor “A” (if the system is designed to support cache-to-cache transfers), or processor “B” can write the cache line into main memory and processor “A” can stall, waiting for “B”'s write to main memory to complete. Both methods (direct cache-B-to-cache-A transfer, or cache-B-to-memory followed by memory-to-cache-A) involve interconnect traffic that consumes precious shared bandwidth which can affect the overall throughput of the computer system and latency for the specific thread that caused the cache miss.
In a conventional kernel that uses affinity-based scheduling in a multiprocessing computer system, the kernel thus attempts to restart a pre-empted thread on a processing device that is associated with the same cache that stored the execution state information for that thread during its former execution. In the case of symmetric multiprocessing computer system designs containing individual processing devices each having their own associated cache (i.e., each processing device is a separate chip), a conventional kernel applies affinity-based scheduling and attempts to restart a preempted thread on that same processing device. In the case of processing devices that are separate cores on a single die that share a common on-board cache, conventional kernels attempt to apply affinity-based scheduling to restart the pre-empted thread on any processing device core that accesses the same cache as the core that formerly executed that thread. This may be a different core on the same chip.
Note that conventional affinity-based scheduling is a best-efforts approach to attempt to have threads that execute using one cache, but that are then preempted, to again use that same cache when they begin execution again. By “best-efforts,” what is meant is that the kernel is not bound by the affinity of a thread to a particular cache. Thus, if a pre-empted thread is ready for execution (e.g., its interrupt completed) but the processing device on which that thread formerly executed is busy executing another thread (e.g., of equal or higher priority to the thread that is now ready for execution), then the kernel will typically override the affinity-based scheduling decision and chose to execute that thread on another available processing device with no thread currently executing.