1. Field of the Invention
The present application relates generally to an improved data processing system and in particular to a method and apparatus for processing data. Still more particularly, the present invention relates to a computer implemented method, apparatus, and computer usable program code for managing data in a cache.
2. Description of the Related Art
A cache is a section of memory used to store data that is used more frequently than those in storage locations that may take longer to access. Processors typically use caches to reduce the average time required to access memory. When a processor wishes to read or write a location in main memory, the processor first checks to see whether that memory location is present in the cache. If the processor finds that the memory location is present in the cache, a cache hit has occurred. Otherwise, a cache miss is present. As a result of a cache miss, a processor immediately reads or writes the data in the cache line. A cache line is a location in the cache that has a tag containing the index of the data in main memory that is stored in the cache. This cache line is also called a cache block.
A design problem currently facing processor development is memory latency. In many processor designs, the cycle time for data delivery from main memory to an execution unit could exceed 400 cycles. To help this problem, local level one (L1) and level two (L2) caches are used. Local level caches are subsets of memory used to help temporal and spatial locality of data, two common architecture problems.
Local memory contention and false sharing problems are introduced when operating systems employ environment techniques like multitasking and multithreading. These applications could cause a cache to thrash. This non-deterministic memory reallocation will decrease the efficiency of locality of data techniques, such as prefetch and castout.
Applications can be separated into three data pattern types: streaming, locking and opportunistic. Streaming is data accessed sequentially, perhaps modified, and then never referred to again. Locking is especially associative data that may be referenced multiple times or after long periods of idle time. Allocation and replacement are usually handled by some random, round robin, or least recently used (LRU) algorithms. Software could detect the type of data pattern it is using and should use a resource management algorithm concept to help hardware minimize memory latencies. Software directed set allocation and replacement methods in a set associative cache will create “virtual” operating spaces for each application. In some cases, software can divide the 8-way set associative cache into the combination of 5 ways and 3 ways, 6 ways and 2 ways, 7 ways, and 1 way. A cache structure is divided into entries (like rows) and ways (like columns). Each entry can have multiple ways. In an 8-way set associative cache, there are 8 ways in each entry. Therefore, data can be stored in 1 out of 8 ways in an entry. A way is also referred to as a set. Opportunistic describes random data accesses.
Pseudo-LRU (p-LRU) is an approximated replacement policy to keep track of the order in which lines within a cache congruence class are accessed, so that only the least recently accessed line is replaced by new data when there is a cache miss. For each cache access, the p-LRU is updated such that the last item accessed is now most recently used and the second to least recently used, now becomes the least recently used data.
A full LRU is very expensive to implement. It requires at least log2(N!) bits per congruence class for an N-way set associative cache (e.g., 5 bits for a 4-way). A commonly used compromise is pseudo-LRU. Traditionally, pseudo-LRU is implemented with a binary tree algorithm, which uses only N−1 bits, or 7 bits for an 8-way set associative cache. Each bit represents one interior node of a binary tree whose leaves represent the N sets.
The goal of pseudo-LRU replacement is to stay as close to the performance as found with a full LRU process while saving the amount of space needed. However, in a case in which the pseudo-LRU process divides the 8-way associative cache in an unbalanced manner into the combination of 5 ways and 3 ways or 6 ways in 2 ways, the pseudo-LRU process only achieves about forty percent of the performance as compared to a full LRU in a consecutive cache miss case. Additionally, the current process only achieves about forty percent of a full LRU process performance in cache accesses that combine cache misses with cache hits.