1. Field of the Invention
The present invention relates to microprocessor architectures and, in particular, to a least recently used (LRU) updating scheme for microprocessor cache memory.
2. Discussion of the Related Art
Digital processors frequently include a small, fast local memory that is used to hold data or instructions likely to be needed in the near future. This local memory is known as a cache. When instructions or data are retrieved from main memory to be used by the processor, they are also stored in the cache, since there is a high probability that they will be needed again in the near future.
The cache is usually constructed from a random access, read/write memory block (RAM). This RAM can access a single stored object (known as a line) in a single processor cycle. The cache size is chosen to match the processor cycle time and usually can be read or written, but not both, during a cycle.
Each line in the cache consists of two pieces: the data being saved and the address of the data in main memory (the tag). FIG. 1 shows a block diagram of a simple cache 10. When the processor makes a reference to main memory, a portion of the reference address, called the index, is used to access a single line stored in the cache RAM 12. If the tag of the accessed line in the cache 10 matches the address of the referenced data, then a "hit" has occurred and the cache RAM 12 supplies the line to the processor immediately. If the tag does not match the reference address, then a "miss" has occurred and the address is supplied to main memory to retrieve the requested line. Main memory is usually much slower than the processor, so a delay occurs while the line is being fetched. When main memory delivers the line to the processor, it is written into the cache RAM 12 using the same index as the original look-up, along with it's tag. The line is also supplied to the processor so that computation can continue. Commonly, the index is merely the low order bits of the main memory address, although other mappings can be used.
The caching scheme shown in FIG. 1 is known as a direct mapped cache. That is, for a given main memory address, the line at that memory address can be placed into exactly one place in the cache. This type of cache is the simplest and fastest type of cache. When a miss occurs, a replacement line can go into only one location in the cache and bumps out whatever happens to be in that cache location.
Direct mapped caches can have pathological miss behavior. For example, assume the existence of a 256 line, direct mapped cache. The index into the cache will be the low order 8 bits of the main memory address. If two different main memory addresses happen to be the same in the low order 8 bits, then they will produce the same cache index and compete for the same line in the cache. Even though the rest of the cache may be completely unused, accesses in succession to those memory locations will constantly miss. Each access bumps out the data for the subsequent access.
To reduce the probability of pathological behavior, and to increase the overall hit rate, caches are frequently constructed with associativity. The degree of associativity measures the number of different places in the cache that a given line from main memory can be placed. Common associativities are 2-way, 4-way, or 8-way, which means that a given line from main memory can be placed in 2, 4, or 8 different places in the cache, respectively. Having more than one location where lines can be placed in the cache reduces the probability of pathological miss behavior and improves the hit rate of the cache.
If a line is to be brought into an N-way associative cache, there will be N different locations in the cache where it can be placed. A choice must be made by the cache controller as to which one of the N different locations is to receive the replacement line. The first choice would be to use an invalid or empty cache location; but eventually all locations get filled and a valid line in the cache must be replaced. A common algorithm used to determine which line to replace is referred to as least recently used (LRU). Using the LRU algorithm, each of the N locations in the cache in which a particular line can be placed has it's usage tracked. When a new line must replace an existing line, the particular line the last use of which was farthest back in time is chosen to be replaced.
To determine which of the N lines is to be replaced, an ordered list is maintained. When a cache line gets used (i.e. a hit occurs), it is moved to the head of the list. The cache line at the end of the list will always be the least recently used line and the choice for replacement if that becomes necessary.
For example, assume the existence of a 5-way associative cache. A given line in main memory can then be placed into five different locations in the cache, the places being numbered from 0 to 4. The initial LRU list is arbitrarily initialized to be:
0- 1- 2- 3- 4 PA1 2- 0- 1- 3- 4 PA1 4- 2- 0- 1- 3
If a cache access and hit occurs for cache line 2, then it gets moved to the head of the list:
If a cache miss occurs, then the new line from main memory will be placed into cache line 4, replacing whatever was previously in line 4, and the LRU list gets updated:
Each time a cache line gets used (hit or miss) it moves to the beginning of the list. Each time a cache line is replaced, the line at the end of the list is chosen for the replacement.
A possible implementation of the LRU list encodes the list into a binary number. In general, there are N! possible LRU lists for each group of lines in an N way associative cache, where N!=N.times.(N-1).times.(N-2).times. . . . .times.2.times.1. This means a minimum of .right brkt-top.log.sub.2 (N!).left brkt-top. bits are needed for each list, where .right brkt-top..times..left brkt-top.=the smallest integer.gtoreq..times.. Table I shows the minimum number of bits needed to encode the LRU list for various associativities.
TABLE I ______________________________________ Associativity LRU Bits ______________________________________ 2 1 3 3 4 5 5 7 6 10 7 13 8 16 ______________________________________
FIG. 2 shows a sample 3-way associative cache 20 built out of three banks of direct mapped caches 22. The LRU RAM block 24 in FIG. 2 is a RAM that contains the same number of entries as any one of the direct mapped caches, but is only three bits wide (from Table I). All of the indexes into the three direct mapped caches 22 and the LRU RAM 24 are the same. The hit signals from all of the direct mapped caches 22 are combined by select logic 26 into a single hit signal, and are also used to control a multiplexer 28. The multiplexer 28 selects the correct line to be delivered as the cache output from whichever bank 22 that happened to hit.
The fundamental problem with encoding the LRU list in the minimum number of bits is that the LRU RAM 24 usually must be cycled twice for each cache access. The LRU bits must be read and then written (if a change is needed), because the value to be written back into the LRU RAM 24 is a function of both the "way" that hits and the LRU list that was read. The data areas of the cache require a single cycle, so if a sequence of consecutive cache accesses is to be supported, then the LRU RAM 24 must either cycle twice as fast as the cache RAMs 22 (which is possible, but difficult), or it must support a read and a write operation to different locations simultaneously (dual port). A dual port memory is generally twice as big and slower than a single port memory and has a completely different internal array structure than a single port memory. In addition, care must be taken when simultaneously reading and writing to avoid coupling noise into the read port.
U.S. Pat. No. 5,325,504, issued Jun. 28, 1994, to R. E. Tipley et al. for "Method and Apparatus for Incorporating Cache Line Replacement and Cache Write Policy Information into Tag Directories in a Cache System" discloses techniques for implementing an LRU scheme for a 2-way associative cache. Each "way" of the cache contains a "partial LRU" bit which is updated when the "way" is used such that the LRU "way" can be determined by the exclusive-OR of the LRU bits from each of the two "ways." The '504 patent also discloses how to do "pseudo LRU" with associativity greater than two by dividing the "ways" into two groups, doing LRU between the two groups and then within a single group.