Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a small fast memory called a cache, either physically integrated within a processor integrated circuit or mounted physically close to the processor for speed. There may be separate instruction caches and data caches. There may be multiple levels of caches.
If a processor requests an item from a cache and the item is present in the cache, the event is called a cache hit. If a processor requests an item from a cache and the item is not present in the cache, the event is called a cache miss. In the event of a cache miss, the requested item is retrieved from a lower level of the memory hierarchy. In many processor designs, the time required to access an item for a cache hit is one of the primary limiters for the clock rate of the processor. Therefore, optimization of access time for cache hits is critical for performance. There is an ongoing need for improvement in cache access time for computer processors.
Typically, a memory is organized into words (for example, 32 bits or 64 bits per word). Typically, the minimum amount of memory that can be transferred between a cache and a next lower level of the memory hierarchy is called a line, or sometimes a block. A line is typically multiple words (for example, 16 words per line). Memory may also be divided into pages (also called segments), with many lines per page. In some systems, page size may be variable.
In many modern computer memory architectures, a CPU produces virtual addresses that are translated by a combination of hardware and software to physical addresses, which access physical main memory. A group of virtual addresses may be dynamically assigned to each page. Virtal memory (paging or segmentation) requires a data structure, sometimes called a page table, that translates the virtual address to the physical address. To reduce address translation time, computers commonly use a specialized associative cache dedicated to address translation, commonly called a Translation Look-aside Buffer (TLB).
If a cache stores an entire line address along with the data and any line can be placed anywhere in the cache, the cache is said to be fully associative. However, for a large cache in which any line can be placed anywhere, the hardware required to rapidly determine if an entry is in the cache (and where) may be very large and expensive. For large caches, a faster, space saving alternative is to use a subset of an address (called an index) to designate a line position within the cache, and then store the remaining set of more significant bits of each physical address (called a tag) along with the data. In a cache with indexing, an item with a particular address can be placed only within a set of lines designated by the index. If the cache is arranged so that the index for a given address maps to exactly one line in the subset, the cache is said to be direct mapped. If the index maps to more than one line in the subset, the cache is said to be set-associative. All or part of an address is hashed to provide a set index which partitions the address space into sets.
With direct mapping, when a line is requested, only one line in the cache has matching index bits. Therefore, the data can be retrieved immediately and driven onto a data bus before the system determines whether the rest of the address matches. The data may or may not be valid, but in the usual case where it is valid, the data bits are available on a bus before the system determines validity. With set-associative caches, it is not known which line corresponds to an address until the full address is compared. That is, in set-associative caches, the result of tag comparison is used to select which line of data bits within a set of lines is presented to the processor.
In a cache with a TLB, the critical timing path for a hit requires a sequence of four operations as follows: (1) a virtual tag must be presented to a Content Addressable Memory (CAM) in the TLB to determine the location of a corresponding physical tag in RAM in the TLB; (2) the physical tag must then be retrieved from the TLB RAM; (3) the physical tag from TLB RAM must then be compared to physical tags accessed from the tag section of the cache; (4) the appropriate data line must be selected. This sequence of four operations is preferably completed in a single phase of a clock cycle, and may in fact limit how short the phase can be. Each of the four operations takes time and a great deal of design effort is dedicated to making each operation as fast as possible. There is a need for improvement in the overall time for the critical time path for accesses to caches with TLB's.