Caches; In General
Conventional cache memory systems are well known in the art. A central processing unit ("CPU") reads information, such as data and instructions, from main memory in order to execute a computer program. Main memories are typically implemented with slower, less costly memory devices for the storage of information. Compared to the speed of a typical CPU, the time required to access main memory and to retrieve information needed by the CPU is relatively slow. Valuable time may be lost while the CPU waits on information being fetched from main memory.
Cache memory is used to minimize memory access time in mainframe computers, minicomputers, and microprocessors. A cache memory typically provides a relatively high speed memory interposed between the slower main memory and a CPU to improve effective memory access rates, thus improving the overall performance and processing speed of the CPU by decreasing the apparent amount of time required to fetch information from main memory. High-speed cache memory may thus be used to shorten the effective memory access time.
in common usage, the term "cache" refers to a hiding place. The name "cache memory" is an appropriate term for this high speed memory that is interposed between a CPU and main memory because cache memory is hidden from the user or programmer, and thus appears transparent. Cache memory serves as a buffer between the CPU and main memory, and is not user addressable. The user is only aware of an apparently higher-speed main memory.
Cache memory is generally smaller than main memory because cache memory employs relatively expensive high speed memory devices, such as a static random access memory ("SRAM"). Therefore, cache memory typically will not be large enough to hold all of the information needed during program execution. When cache memory is full, information must be replaced, or "overwritten" with new information from main memory when the new information is necessary for processing. The information in main memory is typically updated each time the CPU changes the information in cache memory (a process called "store-through"). As a result, changes made to information in cache memory will not be lost when new information enters cache memory and overwrites information which may have been changed by the CPU.
Information is only temporarily stored in cache memory during program execution. When information must be retrieved from memory, the system will first determine whether the information is currently stored in cache memory. If so, the information may be quickly retrieved from the relatively high speed cache memory. When the information that is sought to be retrieved is currently located in cache memory, this is commonly referred to as a "cache hit". A cache hit yields a significant savings in program execution time. When the information that is sought to be retrieved is not currently stored in cache memory, a situation commonly referred to as a "cache miss" occurs. A cache miss requires that the desired information be retrieved, in a relatively slow manner, from main memory and then placed in cache memory. Cache memory updating and replacement schemes attempt to maximize the number of cache hits, and to minimize the number of cache misses.
Set Associative Cache Memories
The minimum unit of information that can be either present or not present in a cache memory is referred to as a "memory block". Memory blocks can be placed and retained in cache memory through three distinct organizations. First, a cache memory is said to be "direct mapped" if each block can be placed in only one place in the cache memory. Second, a cache memory is said to be "fully associative" if a block can be placed anywhere in the cache memory. Third, a cache memory is said to be "set associative" if a block can only be placed in a restrictive set of places in the cache memory, namely, in a specified "set" of the cache memory. Computer systems ordinarily utilize a variation of set associative mapping to keep track of the blocks that have been copied from main memory into cache memory.
The hierarchy of a set associative cache memory resembles a matrix. That is, a set associative cache memory is divided into different "sets" (such as the rows of a matrix) and different "columns" (such as the columns of a matrix). Thus, each block of a set associative cache memory is mapped or placed within a given set and within a given column. The number of columns, i.e., the number of blocks in each set, determines the number of "ways" of the cache memory. Thus, a cache memory with four columns (four blocks within each set) is deemed to be "4-way set associative."
Set associative cache memories include addresses for each block in the cache memory. Addresses are divided into three different fields. First, a "block-offset field" is utilized to select the desired information from a block. Second, an "index field" specifies the set of cache memory where a block is mapped. Third, a "tag field" is used for the purposes of comparison.
When a request originates in a CPU for new information, the index field selects a set of cache memory. The tag field of every block in the selected set is compared to the tag field sought by the CPU. If the tag field of some block matches the tag field sought by the CPU, a "cache hit" occurs and information from the block is obtained directly from the high speed cache memory. If no match occurs, a "cache miss" occurs and the cache memory must be updated. Cache memory is updated by retrieving the desired block from main memory and then mapping this block into the set associative cache.
When a "cache miss" takes place, a block is first mapped with respect to a set, and then mapped with respect to a column. That is, the index field of a block retrieved from main memory specifies the set of cache memory wherein the block will be mapped. A "replacement scheme" is then relied upon to choose the particular block of the set that will be replaced. In other words, a replacement scheme determines the column where the block will be located. The object of a replacement scheme is to select for replacement the block of the set that is least likely to be needed in the near future so as to minimize further cache misses.
Replacement Schemes
Several factors contribute to the optimal utilization of cache memory in computer systems: cache memory hit ratio (probability of finding a requested item in cache), cache memory access time, delay incurred due to a cache memory miss, and time required to synchronize main memory with cache memory (store-through). In order to minimize delays incurred when a cache miss is encountered, as well as improve cache memory hit rates, an appropriate cache memory replacement scheme is needed.
Set associative cache memory replacement schemes may be divided into two basic categories: non-usage based and usage based. Non-usage based replacement schemes, which include first in, first out ("FIFO") and "random" replacement schemes, make replacement selections on some basis other than memory usage. Usage based schemes, which includes the least recently used ("LRU") replacement scheme, take into account the history of memory usage.
FIFO replacement schemes replace the block of a given set of cache memory which has been contained in the given set for the longest period of time. A FIFO replacement scheme may be implemented using a simple modulo N counter for each set, where N equals the number of columns or ways of the cache memory (i.e., the number of blocks contained in each set). The modulo N counter is incremented each time a block is replaced within a given set cache memory. The modulo N counter thus points to the next block of a given set of a cache memory that is selected for replacement.
A random replacement scheme is another non-usage based replacement scheme that can be implemented to randomly replace a block from a given set. One implementation of a random cache memory replacement scheme entails a single modulo N counter where N equals the number of columns or ways of the cache memory (i.e., the number of blocks contained in each set). The modulo N counter may be randomly incremented such that the counter randomly selects a block of a given set of cache memory for replacement.
Typically, usage based cache replacement schemes have been used in an attempt to select the least necessary block of a set for replacement when space is needed in a full cache memory. A popular replacement scheme for this purpose is the least recently used ("LRU") replacement scheme. According to the LRU replacement scheme, the least recently used block of information in cache memory is overwritten by the newest entry into cache memory. Theoretically, LRU replacement schemes typically yield a higher cache memory hit ratio than many FIFO or random replacement schemes. However, LRU replacement schemes have significant drawbacks. In particular, the implementation of a conventional LRU replacement scheme requires a separate data structure for each block to keep track of utilization chronology. Computations are carried out using such data structures for each block in order to determine the least recently used block of a given set. This results in a significantly larger cache. The increase in the size of a cache is, however, often not justified by smaller increases in the hit ratio.
An LRU replacement scheme assumes that the least recently used block of a given set is also the block that is least likely to be reused again in the immediate future. An LRU replacement scheme thus replaces the least recently used block of a given set with a new block of information that must be copied from main memory. An LRU replacement scheme is not, however, always optimal with respect to improving the probability of a cache hit. In some situations, an LRU replacement scheme may achieve an undesirable hit ratio of less than 50%, and in some cases fails catastrophically.
An LRU replacement scheme fails catastrophically when, as is often the case, a cache memory includes fewer columns than the number of blocks which sequentially contend for a given set. An LRU replacement scheme always replaces the least recently used block of information when a cache miss occurs. Unfortunately, in this instance, the least recently used block may be the block that will be required for processing in the near future. This results in a replacement cycle wherein necessary blocks are never in the given set because they were recently replaced by the LRU replacement scheme.
To further clarify the catastrophic failure scenario, assume there are five blocks numbered one, two, three, four and five. The five blocks are each specified to be placed in the same set of cache memory pursuant to the index field of their respective addresses. In this example, cache memory includes only four columns so that only four of the blocks may be contained in the set at any one time. The LRU replacement scheme will try to shuffle the five blocks in and out of the four columns. Assume that the blocks contain instruction sets that are executed cyclically and sequentially and that the set initially holds blocks one, two, three, and four. After executing the instruction set of block four, the CPU will request block five. LRU will attempt to put block five into cache memory. Because the set is full, LRU will select the least recently used block for replacement, to make room for block five. In this example, after executing block four, block one is the least recently used block. Thus, the LRU scheme replaces block one, the least recently used block, with block five. A cache miss will occur when requesting block five because block five was not in the set. Once the instruction set of block five is executed, block one is requested. Unfortunately, the LRU scheme replaced block one, and block one is not currently in the set. Thus, another cache memory miss is incurred for the block one request. Block one must therefore be replaced in the set from main memory. The least recently used block in cache memory is now block two, so the LRU replacement scheme replaces block two with block one. In this scenario, a cache memory hit will never be achieved because the LRU replacement scheme always selects the least recently used block for replacement, but that block is also the next required block. Thus, the LRU replacement scheme fails catastrophically and is detrimental in such a situation.