The present invention relates to cache memories in computer systems. More specifically, the present invention relates to a cache memory replacement algorithm that determines which cache lines in a cache memory are eligible to be replaced when an associative set of the cache memory is full.
In the art of computing, cache memories are used to store a portion of the memory contents of a main memory that are likely to be used soon. As used herein, the term xe2x80x9ccachexe2x80x9d will also be used to refer to a cache memory. Caches are typically smaller and faster than main memory, and are used to mask latencies involved in retrieving memory operands from main memory. In modern computer systems, cache access times are typically about 500% to 3000% faster than main memory access times.
An entry of a cache is known in the art as a cache line, and typically a cache line will store a small contiguous range of main memory contents, such as 32 or 64 bytes. While cache memories are not limited to CPUs, a primary application for cache memories is to store memory operands required by one or more central processing units (CPUs). Note that it is known in the art to provide multiple levels of caches. For example, a CPU may be provided with a level one (L1) cache on the same integrated circuit as the CPU, and a larger and slower level two (L2) cache in the same module as the CPU. In the discussion that follows, it will be assumed that memory operands are loaded into a cache from main memory. However, those skilled in the art will recognize that such operands may also be loaded from a higher level cache if the operands are present in the higher level cache.
Since cache memories are typically smaller than the main memories to which they are coupled, a strategy is required to determine which contents of the main memory are to be stored in the cache. This strategy usually comprises two components: a cache organization and a cache replacement algorithm. The replacement algorithm determines which cache line should be replaced when the cache (or an associative set of the cache, as described below) becomes full.
One of the simplest cache organizations is the direct-mapped cache organization. In a direct-mapped cache, a portion of the main memory address is used as an index, and the remainder of the main memory address (not including any bits of the main memory address that represent bytes within a cache line) is used as a tag. The number of bits used for the index corresponds to the size of the cache. For example, a direct-mapped cache having 64 cache lines will have a index comprising six bits. When a read operation occurs and the memory operand is not in the cache (i.e., the tag does not match), the memory operand is fetched from main memory and stored in the cache line corresponding to the index, and the tag is stored in a tag field associated with the cache line. Assuming the memory operand is still in the cache (i.e., the tags match) the next time a read operation occurs the memory operand will be retrieved from the cache. Incidently, the term xe2x80x9ccache hitxe2x80x9d is used in the art to refer to a memory access wherein the required memory operand is already in the cache, and the term xe2x80x9ccache missxe2x80x9d is used in the art to refer to a memory access wherein the memory operand is not in the cache and must be loaded from main memory or a higher level cache.
The replacement algorithm used with a direct-mapped cache is trivial. For any given byte in the main memory, there is only one cache line in which the byte can be stored. Therefore, if the cache line is in use, the old contents of the cache line are simply overwritten with the new contents. The act of altering the contents of a cache line after the cache line has been loaded from memory is known in the art as xe2x80x9cdirtyingxe2x80x9d the cache line. xe2x80x9cDirtyxe2x80x9d cache lines must be written back to main memory before the new contents can be stored in the cache line. If the old contents in the cache line are identical to the contents in main memory, the old contents may be overwritten without having to write back to main memory.
One problem associated with direct-mapped cache memories is that two often-used memory operands may need to be stored in the same cache line. Since the two memory operands will contend for the same cache line, much of the advantage provided by the cache will be lost as the two operands continuously replace each other.
Another cache organization is the associative cache organization. A fully-associative cache simply has a pool of cache lines, and a memory operand can be stored in any cache line. When a memory operand is stored in an associative cache, the address of the memory operand (excluding any bits representing the bytes stored within the cache line) is stored in a tag field associated with the cache line. Whenever a memory operation occurs, the tag fields associated with each cache line are searched to see if the memory operand is stored in the cache. One disadvantage of an associative cache is that all tag fields of all cache lines must be searched, and as the number of cache lines is increased, the time required to search all tag fields (and/or the complexity of the searching logic) also increases.
The set-associative cache organization is a hybrid of the direct-mapped and associative memory organizations. In a set-associative cache, an index portion of the memory address identifies a subset of the cache lines. As above, a tag field is associated with each cache line. However, only the tags of the subset of cache lines identified by the index need be associatively searched. For example, consider a cache having 256 entries organized into 64 subsets, with each subset having four cache lines. Such a memory will have an index comprising six bits. When a memory operation occurs, the index identifies one of the 64 subsets, and the tag fields associated with the four cache lines in the subset are searched to see if the memory operand is in the cache. The set-associative cache organization allows a cache to have many cache lines, while limiting the number of tag fields that must be searched. In addition, memory operands need not contend for the same cache line, as in a direct-mapped cache.
As used herein, the term xe2x80x9cassociative setxe2x80x9d will be used to refer to all cache lines of a purely associative cache, and to a set of a set-associative cache. When an associative set is full and a new cache line must be stored in the associative set, an algorithm is required to determine which cache line can be replaced. Several such algorithms are known in the art. A xe2x80x9crandomxe2x80x9d algorithm simply picks a cache line at random. While the implementation is simple, the random algorithm provides relatively poor results since there is no correspondence between the cache line contents selected for replacement and the probability that the selected contents will be needed soon.
A better algorithm is the first-in first-out (FIFO) algorithm. This algorithm treats the associative set as a circular queue wherein the cache line contents that have been in the associative set the longest are replaced. This algorithm provides better results than the random algorithm because the algorithm observes cache misses to create correspondence between the cache line selected for replacement and the probability that the cache line will be needed soon. The algorithm works well when all memory contents needed by the CPU are loaded into the cache and other cache misses do not cause the needed memory contents to be replaced. However, the algorithm does not recognize that if a cache line is repeatedly accessed by the CPU, it should not be replaced. The only factor considered is the length of time that the memory contents have been in the cache. The algorithm is slightly more complex to implement than the random algorithm. Typically a single counter associated with an associative set and is used to provide an index indicating which cache line is next in line for replacement, and the counter is incremented every time there is a cache miss and an operand is loaded from main memory.
One of the best prior art cache replacement algorithms is the least recently used (LRU) algorithm. As the name implies, this algorithm discards the cache line contents that were used least recently. This algorithm tends to be very effective because the algorithm observes both cache hits and cache misses to create correspondence between the cache line selected for replacement and the probability that the cache line will be needed soon. However, the algorithm is relatively complex to implement because a counter value is typically associated with each cache line.
To illustrate how the LRU algorithm functions, consider a full associative set having eight cache lines. A three-bit LRU counter value is associated with each of the cache lines and each counter value is unique, with a counter value of xe2x80x9c000xe2x80x9d representing the least recently used cache line and a counter value of xe2x80x9c111xe2x80x9d representing the most recently used cache line. When a cache miss occurs, the memory operand is loaded into the cache line having a counter value of xe2x80x9c000xe2x80x9d, the counter value of this cache line is set to xe2x80x9c111xe2x80x9d, and all the other counter values are decremented. When a cache hit occurs, the counter values of all cache lines having a counter value greater than the counter value of the cache line that contains the required memory operand are decremented, and the counter value of the cache line that contains the required operand is set to xe2x80x9c111xe2x80x9d. Clearly, the logic to implement the LRU algorithm is more complex than the logic required to implement the FIFO algorithm. Other algorithms are known in the art which approximate the LRU algorithm, but are less complex to implement. The LRU algorithm (and to a lesser extent the FIFO algorithm) work well with CPU access patterns because CPUs tend to use the same data and code several times due to loops and data manipulations.
As the art of computer design continues to advance, it is becoming apparent that cache memories may also be beneficially used to increase the performance of input/output (I/O) subsystems. In the prior art, it was typical to simply provide a few buffers between an I/O subsystem and a main memory, with the buffers holding no more than a few memory words. However, one problem associated with using caches in I/O subsystems is that the algorithms that work so well with CPU memory access patterns tend to work less well for I/O subsystem memory access patterns because cache lines may be replaced before they are used.
I/O memory access tend to be much more linear in nature, and reuse of data stored in the cache is much less likely. To hide the latency of main memory, I/O subsystems tend to xe2x80x9cpre-fetchxe2x80x9d many cache-lines of data. The term xe2x80x9cpre-fetchxe2x80x9d is known in the art and refers to the process of speculatively loading memory operands into a cache before the operands may be needed by a CPU or I/O subsystem. If a cache line required by an I/O stream of an I/O device is already in the cache (a cache hit), the I/O device will see a very small latency. However, if the cache line is not in the cache (a cache miss) the latency will be quite large. Note that an I/O device can have multiple active I/O streams, and pre-fetching is typically required for each stream.
Ideally, a cache associated with an I/O subsystem would be large enough so that the I/O subsystem could pre-fetch enough cache lines so that all I/O streams of all I/O devices would mostly encounter cache hits. Unfortunately, the number of cache lines required is the maximum number of I/O devices multiplied times the maximum number of I/O streams multiplied times the number of desired pre-fetches, and it is often not practical provide such a large cache.
Consider what would happen in a computer system that pre-fetches I/O data into a cache using the prior art LRU or FIFO replacement algorithms discussed above when a large number of open files are written to a disk write drive simultaneously. An I/O stream is associated with each file, and data required by each stream is pre-fetched into the cache. Further assume that the cache is filled before the I/O device is ready to accept any data. Both the LRU and FIFO algorithms will discard the contents of the cache lines that were loaded first, even though those cache lines are the ones which are most likely to be needed soon. In other words, using the LRU and FIFO algorithms, a later pre-fetch can cause replacement of cache lines just before an I/O device would have used these cache lines. Accordingly, the cache lines replaced were more important than at least some of the cache lines that were just pre-fetched. Of course, when this occurs the I/O subsystem generates a cache miss and the cache lines that are now needed must be reloaded.
The problem is made worse by the fact that I/O devices and subsystems often communicate by writing and reading from memory locations that are mapped to provide control functions. Once an operand is written to one of these memory locations, it is no longer needed. However, both the LRU and FIFO algorithms will retain the operand longer than necessary. What is needed in the art is a replacement algorithm for use with an I/O subsystem cache that does not replace cache lines just before the cache lines are about to be used, while allowing replacement of a cache lines as soon as the cache lines have been used and allowing replacement of cache lines that are not likely to be needed soon.
The present invention relates to a cache memory replacement algorithm that replaces cache lines based on the likelihood that cache lines will not be needed soon. A cache memory in accordance with the present invention is especially useful for buffering input/output (I/O) data as such data is transmitted between I/O devices and a main memory.
A cache memory in accordance with the present invention includes a plurality of cache lines that are accessed associatively. A count entry associated with each cache line stores a count value that defines a replacement class. The count entry is typically loaded with a count value when the cache line is accessed.
In accordance with the present invention, when speculative pre-fetches are performed to load the cache with main memory contents that are expected to be written to an I/O device, a replacement class is associated with each cache line by loading a count value into the count entry of each cache line and several status bits are updated. Replacement classes are assigned to cache lines based on the likelihood that the contents of cache lines will be needed soon. In other words, data which is likely to be needed soon is assigned a higher replacement class, while data that is more speculative and less likely to be needed soon is assigned a lower replacement class.
When the cache memory becomes full, the replacement algorithm selects for replacement those cache lines having the lowest replacement class. Accordingly, the cache lines selected for replacement contain the most speculative data in the cache that is least likely to be needed soon.
Using prior art cache replacement algorithms, cache lines tend to be replaced based on how long data had been in the cache, how long it has been since data was accessed, or at random. In a cache memory used to buffer I/O data, these prior art replacement algorithms tend to replace cache lines just before they are about to be used, while retaining cache lines that tend to be speculative and will not be needed soon. In the present invention, the cache lines most likely to be needed soon are least likely to be replaced, thereby maximizing the probability of a cache hit.