1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a hardware-assisted method for scheduling threads using data cache locality. The method uses hardware primitives to facilitate the scheduling process, resulting in the exploitation of thread reference locality and improved performance.
2. Background Description
In a multithreaded operating system there is a fundamental problem associated with scheduling runnable threads to maximize the throughput of the system. At the speeds that current CPUs run, the performance bottleneck in executing programs is direct access to memory.
FIG. 1 is a block diagram illustrating an n-way set-associative L2 cache, according to the prior art. A request for a memory address comes in on the bus and is stored in the memory address buffer. A portion of the address is used as a tag which is hashed simultaneously in each set. In an n-way cache, at most one row in one set will have the required data. This is called a cache hit. If the tag is not found, it is a cache miss. On a cache hit, the index portion of the address is used to get an offset into the cached data and the data at that point is returned to the CPU. The element designated “V” in FIG. 1 is the valid bit. The valid bit is set if the associated data is valid; otherwise, the valid bit is reset. The element designated “DATA” in FIG. 1 is the cache line. The valid bit is associated with the cache line. Thus, the cache line may have valid or invalid data. Accordingly, the hit line and the valid bit are ANDed together to release the data (cache line). There are a number of events that may set or reset the valid bit. First, if the cache is initially empty, all of the valid bits are reset. Each valid bit is then set every time the associated cache line is placed in the cache. Each valid bit is reset when the associated line is removed from the cache. The valid bit can also be reset if the associated line is invalidated (e.g., using a cache invalidation).
Predictive caching and prefetching have increased cache hits to around 98% to 99% but a cache miss has also become more expensive, usually costing at least several hundred instruction cycles while data is brought from main memory into the L2 cache. Such a stall affects all threads that are bound to that CPU in a multi-processor environment, and in the case of shared caches, all CPUs in the system.
As more and more applications are designed around thread packages, the average number of live threads on a system has also increased. As the number of threads increases, the potential for parallelism also increases but it also stresses the cache. This impacts the threads that are bound to the CPU of the associated cache. To date, there are few alternatives, other than increasing the size of the cache (which has its own disadvantages), to address this issue.
It is therefore desirable to schedule threads that share the same data on the same CPU. This could improve the performance of multi-threaded applications by reducing the number of likely cache misses. Cache locality has been extensively studied; however, not in the context of multi-threaded scheduling algorithms.
With respect to thread scheduling based upon cache locality, existing solutions determine the inter-thread data locality by either exploiting hints derived by user annotations and compiler optimizations, evaluating information collected from hardware performance monitors, or some combination of these. Exploiting hints derived by user annotations and compiler optimizations is described in the following articles: Bellosa et al., “The Performance Implications of Locality Information Used in Shared-Memory Multiprocessors”, Journal of Parallel and Distributed Computing, Vol. 37, No. 1, pp. 113-21, August 1996; Elder et al., “Thread Scheduling for Cache Locality”, ASPLOS VII, pp. 60-71, October 1996; Sinharoy, B., “Optimized Thread Creation for Processor Multithreading”, The Computer Journal, 40(6), pp. 388-400, 1997; and Nikolopoulos et al., “Efficient Runtime Thread Management for the Nano-Threads Programming Model”, 12th International Parallel Processing Symposium and 9th Symposium on Parallel and Distributed Processing, pp. 183-94, March 1998. Evaluating information collected from hardware performance monitors is described in the following articles: Bellosa, F., “Locality-Information-Based Scheduling in Shared-Memory Multiprocessors”, Workshop on Job Scheduling Strategies for Parallel Processing, IPPS, pp. 271-89, April 1996; and Weissman, B., “Performance Counters and State Sharing Annotations: a Unified Approach to Thread Locality”, ASPLOS VIII, pp. 127-38, October 1998.
Accordingly, it would be desirable and highly advantageous to have a methodology for multi-thread scheduling using data cache locality.