1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to an improved method and apparatus for caching data in a memory. Specifically, the mechanism of the present invention may be used to improve the setID selection of existing and future cache replacement algorithms, such as non-100% accurate least-recently-used heuristics.
2. Description of Related Art
Most early data processing systems consisted basically of a central processing unit, a main memory, and some sort of secondary input/output (“I/O”) capability. In these earlier systems, the main memory was the limiting element. Over time, logic circuit speeds increased along with the capacity requirements of main memory. With the need for increasing capacity in the main memory, the speed of the main memory could not keep up with the increasing speed of the CPU. Consequently, a gap developed between the main memory and the processor cycle time, which resulted in un-optimized processing speeds. As a result, a cache memory was developed to bridge the gap between the memory and the processor cycle time.
Using a cache to bridge the performance gap between a processor and main memory has become important in data processing systems of various designs from personal computers to work stations to data processing systems with high performance processors. A cache memory is an auxiliary memory that provides a buffering capability through which a relatively slow main memory can interface with a processor at the processor's cycle time to optimize the performance of the data processing system. Requests are first sent to the cache to determine whether the data or instructions requested are present in the cache memory. A “hit” occurs when the desired information is found in the cache. A “miss” occurs when a request or access to the cache does not produce the desired information. In response to a miss, one of the cache “lines” is replaced with a new one. The method to select a line to replace is called a replacement policy.
A number of different schemes for organizing a cache memory exist. For example, a fully associative mapping organization may be employed whereby a data address may exist in any location in the cache, or a direct mapping scheme may be employed in a cache memory whereby a data address may exist in only one location in the cache. A set associative scheme may be employed by partitioning the cache into distinct classes of lines, wherein each class contains a small fixed number of lines. This approach is somewhere between a direct mapped and a full associative cache. The classes of lines are usually referred to as “congruence classes.” The lines in a congruence class are usually referred to as sets (which indicate the number of locations an address can reside) in a congruence class in a set associative cache. Each set has a setID that is used to identify each slot in a congruence class.
One generally used type of replacement policy is the least-recently-used (LRU) policy. An LRU policy is built upon the premise that the least recently used cache line in a congruence class is the least worthy of being retained. So, when it becomes necessary to evict a cache line to make room for a new one, an LRU policy chooses as a victim a cache line which is the least recently accessed set (or member) within a congruence class.
A most-recently-used-update (MRU-update) operation typically occurs due to a cache hit. It adjusts the LRU state such that the “hit” member is ordered ahead of all other members in that congruence class, establishing the cache line in that member position as the most worthy member in the congruence class.
Several factors complicate the behavior of LRU replacement policies in multi-level cache hierarchies, particularly when those hierarchies contain nth level caches that are shared by multiple structures at level n−1. For example, a processor may contain a first level instruction cache and a first level data cache. These may be backed by a second level cache that includes both instructions and data. Such a structure is designed so that processor requests for cache lines that miss in the first level caches have a high likelihood of being found in the second level cache.
As described earlier, the LRU replacement policy in the first level caches would update as most-recently-used those cache lines that are used most often by the processor. Cache lines that are less important (or worthy) to the processor, since they are used less often, would be less likely to be marked as most-recently-used. Thus, the more frequently used lines tend to remain in the first level cache, while the less frequently used lines tend to be evicted from the first level cache. When making design choices for an LRU replacement algorithm to implement in a system, simple binary tree algorithms are typically favored over more accurate “true-LRU” algorithms. An example of the binary tree algorithm is described in “Cache Line Replacement Selection using a Logical Multi-Way Tree with Access Order States Maintained at Each Node”, which can be found on the World Wide Web at priorartdatabase-dot-com/IPCOM/000030586, and is hereby incorporated by reference. In contrast with binary tree algorithms, a true-LRU algorithm accurately tracks the accessing of each individual cache line. In this manner, a true-LRU algorithm tells precisely which line is the least recently used. However, the implementation of a true-LRU algorithm has considerable overhead and is not a very realistic approach for N-way set associative caches when N>5. The number of states needed for a true-LRU implementation is also prohibitive from an area/power standpoint.
In addition to the fact that microprocessors that attain the highest frequencies are implemented with deep pipelines and short pipeline stages, simple binary tree algorithms are preferred for their simplicity of implementation. Thus, the less accurate binary-tree algorithm that allows for a higher overall frequency usually provides the best way to maximize overall machine performance.
Using binary tree algorithms are also desirable because the algorithms do not require knowledge of the current state of the LRU bits when establishing a new LRU or MRU candidate. Consequently, the algorithms can be implemented with the simplest form of array structure—a one port read or write array. The area savings of such a design is beneficial to the overall goal of cost savings (e.g., smaller chip area) and achieving the highest frequencies (e.g., less consumption of critical area on a custom very large-scale integration (VLSI) processor design).
Although the use of the simple array and binary tree LRU replacement algorithms provide many benefits, they also contain several drawbacks. One problem encountered using the simple algorithm is that the quality of the LRU slot ID produced by the algorithm may be poor enough to degrade performance due to poor cache line replacement choices. Another problem is that the simple LRU array described above cannot be updated on the same cycle as a lookup. Updates are performed at a later time when there is an empty cycle, or when a reload writes its data into the L1 cache. This situation creates a window where the same setID will be given to multiple cache miss fetch requests to the same congruence class until the first fetch request returns and updates the LRU. Having the same setID assigned to multiple cache miss fetch requests would allow multiple fetches to write into the exact same location. This situation is undesirable since having data written to the same location would corrupt the cache data by having sections of many cache lines overlaid on top of one another.
Existing methods that have addressed this problem include single and multiple fetch designs. These methods, however, still have negative impacts on system performance. One such method is the single fetch design, which only allows for one outstanding fetch in a particular time period. Another such method is a multiple fetch design, which does not allow for multiple fetching to a particular congruence class, but rather this method just blocks a fetch if another fetch is outstanding to the same congruence class.
Therefore, it would be advantageous to have a mechanism that allows an n-way set associative cache to have n L1 miss fetch requests simultaneously in flight regardless of their congruence class. It would further be advantageous to have a hybrid replacement policy that allows for identifying empty slots of a given congruence class, and, if an empty slot is found, giving the empty slot ID a higher priority than the slot selected by the binary tree algorithm.