Conventional time-limited affinity scheduling within a multiprocessor computer system typically involves the scheduling subsystem of the computer system unparking and preferentially dispatching a thread onto the processor on which the thread last ran if less than a predetermined time period has elapsed since the thread last ran (i.e., the thread is still “warm”). This affinity-aware manner of scheduling during the predetermined time period enables the computer system to enjoy more-efficient operation since it is likely that a warm thread retains some residual affinity for the processor on which that thread recently ran. Specifically, using time-limited affinity-aware scheduling, the odds are much improved that the code cache, data cache and translation lookaside buffer (i.e., a cache of virtual to physical memory translations), still contain information used by the thread when it last ran on that processor. Furthermore, it is likely that the thread will access much of that same data after it resumes execution.
It should be understood that time-limited affinity-aware scheduling reduces the so-called “cache reload transient” penalty. In particular, when a “cold” thread is initially dispatched onto a processor, the cold thread will start to access its own code, data and translations displacing previous information in the caches and populating the caches and translation lookaside buffers (TLBs) with its own thread-specific information. During this period (the cache reload transient), the thread incurs a significant number of cache misses and translation misses, significantly slowing its execution. If the caches use a write-back policy instead of a write-through policy, when the cold thread misses, it will have to wait again while its data is loaded from memory into the cache. That is, when the cold thread misses, cached data which is likely associated with another thread will be evicted from the cache. Next, new data for the cold thread must be brought into the cache. Accordingly, there are two penalties, namely, the write-back of the cached data and then the subsequent loading of the new data into the cache. This situation would also occur if the thread is unparked after a substantial amount of time has passed (e.g., if the thread becomes cold again due to the expiration of the predetermined time period).
Before presenting an example of time-limited affinity scheduling, a brief discussion of the taxonomy of cache misses will be provided. As a thread executes on a processor, the following can be found in the processor's cache:                A. Shared read-only executable code (i.e., it is typical for threads to share code but access distinct data).        B. Shared read-only data (e.g., a table that never changes).        C. Shared read-write data (e.g., data protected by a lock or mutex mechanism of some kind in order to coordinate access to the data and avoid interference between threads).        D. Private thread-specific read-write data (e.g., a thread's own stack).        E. Residual data and code installed in the cache by the execution of some prior thread, but that has not yet been displaced by the current thread.By convention, no locks or mutex mechanisms are needed to access (A), (B) and (D).        
Furthermore, on a multiprocessor system, copies of shared read-only elements may reside in the caches of multiple processors at the same time. For read-write data, however, in order to provide coherency and a consistent view of memory, only one processor typically can have a “dirty” or modified copy in its caches at a given time. If some other processor needs to modify or read that dirty cache line, the processor with the dirty copy must pass the cache line to the requesting processor by way of the coherency interconnect.
There are two types of caches misses. A cache miss against read-only lines, such as executable code, will be satisfied from memory. A miss against a dirty cache line incurs more latency on most architectures as the system needs to “steal” or migrate the cache line from the processor that owns the cache line (i.e., the processor that has the cache line in modified/dirty state). Both types of misses incur latency (which impacts the missing CPU) and consume memory and coherency interconnect bandwidth which can impact the throughput of the system as a whole, as the coherency and memory channels have fixed bandwidth. That bandwidth usually is a fixed resource, and efforts to conserve bandwidth yield improve scalability.
The following is an example of conventional time-limited affinity scheduling in a JAVA®-equipped multiprocessor system. Suppose that the system has threads which share access to a critical section of memory (commonly referred to as a “synchronized block”) using a user-mode mutual exclusion technique. In particular, suppose that a first thread and a second thread coordinate access to the critical section using JAVA monitors provided by a JAVA Virtual Machine. Along these lines, suppose that the first thread is currently running on a first processor and that the second thread is currently blocked after running on a second processor. Furthermore, suppose that the first thread currently owns a lock on the critical section and is in the process of writing data to the critical section, i.e., the first thread owns a monitor and the second thread resides on the entry list of that monitor.
When the first thread finishes its work, the first thread relinquishes the lock on the critical section and wakes the second thread. That is, the first thread exits the monitor and calls a JAVA routine to unpark the second thread (e.g., “unpark( )”). It should be understood that threads “park” themselves and that, once parked, a thread is ineligible to run until it is subsequently unparked by some other thread. Parked threads do not appear on the dispatch queue (“ready queue”) so the system scheduler will never dispatch a parked thread. Unparking a thread makes the thread runnable and places the thread on a ready queue.
In response to the JAVA procedure call, the scheduling subsystem places the second thread on the ready queue of one of the processors, and eventually picks the second thread from that ready queue for running. When the second thread begins execution, the second thread can attempt to obtain ownership of the monitor in order to access the critical section.
When the scheduling subsystem places the second thread onto the ready queue of one of the processors, the scheduling subsystem looks at state information of the second thread to (i) determine how much time has elapsed since the second thread last ran, and (ii) identify the processor that last ran the second thread. If less than the predetermined time period has elapsed (i.e., if the second thread is still warm), the scheduling subsystem performs affinity-aware scheduling by moving the second thread onto the ready queue of the processor that last ran the second thread, i.e., the second processor. Accordingly, some of the cached executable thread code for the second thread may still reside in the cache thus alleviating the need to re-cache that executable thread code.
However, if more than the predetermined time period has elapsed (i.e., if the second thread is now cold), the scheduling subsystem disregards affinity (i.e., the system operates in an affinity-unaware manner) by moving the second thread onto the ready queue of the least-utilized processor which may or may not be the second processor. Here, it is unlikely that any executable thread code for the second thread remains cached. Accordingly, the multiprocessor system places a higher value on operating in a load balanced manner vis-à-vis an affinity-aware manner, and thus schedules the second thread on the processor which is least busy.