1. Technical Field
The present invention relates to data processing and in particular to multi-threaded operations in cache memory of data processing systems. Still more particularly, the present invention relates to a method and system for predicting a way in set-associative caches in a multi-threaded processing environment.
2. Description of the Related Art
Processing systems employing cache memories are well known in the art. A cache is a hardware-managed buffer designed to reduce memory access latency by copying data from memory, which is likely to be accessed in the near future. Cache memory systems store the most frequently accessed instructions and data in the faster accessible cache memory to overcome the problems of long memory access latency when needed data/instructions have to be retrieved from the lower level memory. Thus, with the utilization of caches, the average memory access time of the overall processing system approaches the access time of the cache.
In the presence of an associated cache, a device that needs to access memory, such as a processor, first looks in the cache for a copy of data from the desired memory location. If a copy is found, the device uses the copy, thus avoiding the longer latency of accessing memory itself. Caches are used for both data and instructions, and a system may have multiple caches.
Typically, these caches include bifurcated level one (L1) instruction cache (I-cache) and L1 data cache (D-cache) and a larger L2 cache. Generally speaking, an I-cache is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units of the processor. Processor execution involves first retrieving (or fetching) a set of instructions for processing by the execution units. These instructions are initially loaded from memory and stored within I-cache following an initial processor request for instructions from the memory address within the request.
An I-cache holds a fixed number of cache-line entries, each containing the cached instructions as well as enough information to identify the memory address associated with the instructions and some cache management state information. Because caches map directly to memory locations, cache addresses are typically physical (or real) addresses that mirror their corresponding physical memory addresses. The physical address information of the I-cache are stored within an associated I-directory.
A number of cache-to-memory mapping techniques are utilized, including: (i) fully associative, (ii) direct-mapped, and (iii) set-associative. These techniques differ in the group of the cache-line entries that can be used to store a cache-line with particular address bits in common. Set associative caching, for example, involves a multi-array configuration that comprises two or more directories and two or more associated data arrays (otherwise termed “banks”, “compartments”, or “ways”). Typically, the critical path, i.e., the path requiring the most time to complete, in a set associative cache, is through a directory to a compare circuit that selects the memory way/bank/set of the I-cache from which the requested instructions will be selected. The selection of data from one way over the other way(s) is completed via a MUX select and is referred to as a late select.
FIG. 1 illustrates a cache subsystem that comprises conventional two-way set-associative cache, having two ways or sets. Data having the same lower order bits of the effective address (EA) (which are not translated during address translation and thus utilized to access the I-directory) may be held concurrently in multiple ways of the cache 102. The block (cache line) of instructions are pre-fetched utilizing the lower order bits of the EA and held in latches 107A and 107B until one of the ways is selected. Each way includes directory, I-Dir0 103A and I-Dir1 103B, respectively and array, array0 105A and array1 105B, respectively. Both arrays are addressed by some set of bits, which are normally the lower order address bits.
Set associative cache 102 further comprises additional components, including comparator 109 and MUX 111. Comparator 109 is utilized to compare the real address (i.e., the real page number) found in the address register 117 with the real address within I-Dir 103A and I-Dir 103B during way selection. Comparator 109 then provides an output that is utilized to select the particular way (array0 or array1) from which to load the cache line of instructions.
Generally, the set-associative cache-management method provides one-cycle reads, which involve accessing data from multiple sets (or ways) in parallel before a tag match is determined. Once a tag match is determined, the tag is used to select one of the accessed cache memory locations to be coupled to the processor for the read operation.
Matching the lower order address bits of the EA within the request (i.e., the offset bits, which are not translated and which are utilized to access the I-directory) and the address tag of the array, array0 105A or array1 105B results in the buffering of the corresponding instruction block to be outputted from the particular array. The real page number of real address (RA) from RA register 117 is compared by comparator 109 with the real page number of the RA within the I-directory to determine whether the buffered instruction block is the instruction block being requested. In this manner, the data is either allowed to continue being read or a read miss at the I-cache is signaled. Those skilled in the art are familiar with the structure and operational features of the set associative cache illustrated herein.
Real addresses are required during operation of set-associative cache 102. Since the processor's memory access operations are issued with effective addresses, an effective-to-real address translation or look-up is required at some point during the completion of the way selection. Normally, the address translation involves measurable latency, but the latency is accepted as a necessity to complete standard processor-driven cache-to-memory address mapping. In some systems, address translation pairs (i.e., real and effective address pairs) are stored within an Effective-to-real Address Translation table (ERAT) 115, which is utilized to enable faster access. According to FIG. 1 and, as is known by those skilled in the art, EA 113 is issued to cache subsystem 101, and corresponding RAs are found by looking up the EAs in ERAT 115. The look-up of RA is completed concurrently with the access to the I-Dir. All of the ERAT real addresses are compared to the RAs from I-Dir 103A and I-Dir 103B. If an ERAT RA matches an I-Dir RA and the corresponding EA in the ERAT matches the fetch address, then a “hit” occurs in the I-cache.
Additionally, in some current cache designs, an additional effective address (EA) directory is utilized to provide a prediction mechanism for predicting whether particular instructions being requested are likely to be resident in the I-cache. The EA directory contains a truncated portion/subset of each EA corresponding to a particular line of the physical memory that is stored within the I-cache. Because the size of the I-cache is relatively smaller than that of physical memory, finding data in an I-cache can typically be accomplished by checking the lower address bits of the EA, with some restrictions. These restrictions arise because, although the EA Directory must contain a similar lower order address as the requested data for the data to be present in the I-cache, a hit in the EA Directory does not imply that the requested data is actually present in the I-cache. A matching of lower-order EA bits is therefore necessary for a cache hit, but not sufficient on its own to confirm that the data is actually within the I-cache.
One recent improvement in data processing that affects how instructions are cached and retrieved for processing is the implementation of multi-threaded processor operations, including simultaneous multi-threaded processor operations. Program applications executing on the processor are executed as a series of threads. Each threads comprises a sequence of instructions. At any given time, information from multiple threads may exist in various parts of the machine. For example, with two executing threads, both threads appear to the OS as two separate processors. Each of the two threads thus have their own copy of all the normal registers that a program can access and/or modify.
Each thread may be working on the same task, or each thread may be working on a different task. That is, the threads can be components of the same application/program. In some implementations, multiple copies of the same program are executed concurrently, and each copy provides its own set of threads. However, the threads are generally not of the same program.
When two programs are run concurrently, processor resources are better utilized. For example, (1) as one program waits for data from memory, the other program is able to proceed out of the cache, or (2) if both programs are running out of the cache, one program may utilize lots of floating point resources, while the other program utilizes lots of fixed point resources. The net result is better utilization of the processor resources.
Often, the two threads do not share the same EA-to-RA translation. However, because it is common for certain EAs to be utilized and re-utilized, threads of different application with different RAs are often given the same EAs. For example, the linker may always start at EA 20000 when a program begins loading irrespective of whether or not another thread (of another program) has been assigned the EA 20000. However, these EAs map to different physical addresses in the physical memory space. Thus, in the multi-threaded environment, different threads from different applications utilizing processor resources may share the same EAs but because they map to different RAs, the threads cannot be handled in the same manner by the way-select mechanisms of the I-cache particularly those that include an EA directory and associated prediction features.
When multi-threaded operations are carried out on a data processing system configured with set-associative I-caches and these operations involve concurrently executing applications, whose threads share EAs, the conventional way prediction mechanisms referenced above are not always capable of efficiently providing correct way selection for each thread. With the above mechanism, the particular way in which a cache line of the first application is selected would also be selected as the way for a second application that maps to a different RA when the lower-order bit selection is utilized.
Thus a problem is encountered with current implementation of the way predictor scheme when I-cache arrays contain entries of multiple threads that may share similar EAs. Notably, in some situations, i.e., when both threads are from the same application, it is important that both threads be able to share I-cache entries since both threads have the same translation of effective to real addresses. Within a multiple application environment, however, this sharing of cache lines would cause the I-cache to thrash, i.e., the application gets repeatedly kicked out of the cache, resulting in very slow processing of the application. Both ways are prevented from hitting in the same EA Dir. If there is a hit in the EA Dir and a miss in the cache, that new data is loaded into the same way. Thus, if both threads were utilizing the same EA, the two threads would constantly replace each other instead of one being in one way and the other being in the other way.
One possible method of providing correct multi-threaded way-prediction involves storing the thread ID in the prediction array and then requiring the thread ID to also match in order to select that way. Implementation of this method, however, prevents the sharing of entries between the threads of the same application and this sharing of entries is beneficial to same-application thread processing.
The present invention thus recognizes that it would be desirable to provide a method, system, and cache architecture that enables efficient and accurate way-prediction when threads of different applications share lower order bits of EA but map to different RAs. A set-associative cache architecture that provides a mechanism for identifying which way each thread executing on a processor maps to in order to enable correct predictive way selection for threads that share EAs, which map to different RAs would be a welcomed improvement. The invention further recognizes that it would be beneficial to reduce the latencies involved in the late select process of conventional set-associative caches. These and other benefits are provided by the invention described herein.