Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a small fast memory called a cache, either physically integrated within a processor integrated circuit or mounted physically close to the processor for speed. There may be separate instruction caches and data caches. There may be multiple levels of caches. An item that is fetched from a lower level in the memory hierarchy typically evicts (replaces) an item from the cache. The selection of which item to evict may be determined by a replacement algorithm. The present patent document is concerned with replacement strategies and algorithms, as explained further below.
The goal of a memory hierarchy is to reduce the average memory access time. A memory hierarchy is cost effective only if a high percentage of items requested from memory are present in the highest levels of the hierarchy (the levels with the shortest latency) when requested. If a processor requests an item from a cache and the item is present in the cache, the event is called a cache hit. If a processor requests an item from a cache and the item is not present in the cache, the event is called a cache miss. In the event of a cache miss, the requested item is retrieved from a lower level (longer latency) of the memory hierarchy. This may have a significant impact on performance. The average memory access time may be reduced by improving the cache hit/miss ratio, reducing the time penalty for a miss, and reducing the time required for a hit. The present patent document is primarily concerned with improving the hit/miss ratio of a cache.
Ideally, an item is placed in the cache only if it is likely to be referenced again soon. Items having this property are said to have locality. Items having little or no reuse "pollute" a cache and ideally should never be placed in a cache. There are two types of locality, temporal and spatial. Temporal locality means the very same item is likely to be referenced again soon. Spatial locality means that items having addresses near the address of a recently referenced item are likely to be referenced soon. For example, sequential data streams and sequential instruction streams typically have high spatial locality and little temporal locality. Since data streams often have a mixture of temporal and spatial locality, performance may be reduced because sections of the data stream that are inherently random or sequential can flush items out of the cache that are better candidates for long term reference. Typically, the minimum amount of memory that can be transferred between a cache and a next lower level of the memory hierarchy is called a line, or sometimes a block or page. Typically, spatial locality is accommodated by increasing the size of the unit of transfer (line, block, page). In addition, if a data stream is sequential in nature, prefetching can also be used. There are practical limits to the size of cache lines, and prefetching can flush lines that may soon be reused from the cache. The present patent document is primarily concerned with strategies for ensuring that lines having the highest probability of reuse remain in a cache, and for ensuring that lines having a lower probability of reuse do not evict lines having a higher probability of reuse.
If a cache stores an entire line address along with the data and any line can be placed anywhere in the cache, the cache is said to be fully associative. However, for a large cache in which any line can be placed anywhere, the hardware required to rapidly determine if an entry is in the cache (and where) may be very large and expensive. For large caches, a faster, space saving alternative is to use a subset of an address (called an index) to designate a line position within the cache, and then store the remaining set of more significant bits of each physical address (called a tag) along with the data. In a cache with indexing, an item with a particular address can be placed only at the one place (set of lines) within the cache designated by the index. If the cache is arranged so that the index for a given address maps to exactly one line in the subset, the cache is said to be direct mapped. In general, large direct mapped caches can have a shorter access time for a cache hit relative to associative caches of the same size. However, direct mapped caches have a higher probability of cache misses relative to associative caches of the same size because many lines of memory map to each available space in the direct mapped cache. If the index maps to more than one line in the subset, the cache is said to be set associative. All or part of an address is hashed to provide a set index which partitions the address space into sets. For a direct mapped cache, since each line can only be placed in one place, no algorithm is required for replacement. In general, all caches other than direct mapped caches require an algorithm for replacement. That is, when an index maps to more than one line of memory in a cache set, we must choose which line to replace. The present patent document is concerned with any caches in which a replacement algorithm is required for determining which item in a cache to evict when a new item is added to the cache. Therefore, the present patent document is primarily concerned with set associative or fully associative caches.
Typically, a memory is organized into words (for example, 32 bits per word) and a line is typically multiple words (for example, 16 words per line). Physical main memory is also typically divided into pages (also called blocks or segments), with many lines per page. In many modern computer memory architectures, a CPU produces virtual addresses that are translated by a combination of hardware and software to physical addresses, which access physical main memory. A group of virtual addresses may be dynamically assigned to each page. Virtual memory (paging or segmentation) requires a data structure, sometimes called a page table, that translates the virtual address to the physical address. Typically, a page table entry (PTE) includes more than just an address. A PTE may include information regarding write protection, use authorization, and many other status bits and attribute bits useful to the operating system. To reduce address translation time, computers commonly use a specialized associative cache dedicated to address translation, commonly called a Translation Look-aside Buffer (TLB). A TLB entry is a cache entry, where the tag is the high order bits of the page's virtual address and the data portion is a physical page address plus the additional status bits and attribute bits stored in a PTE.
In the event of a cache miss, typically one line in a cache is replaced by the newly requested line. In the case of a direct mapped cache, a new line replaces a line at one fixed place. In the case of fully associative caches, a replacement algorithm is needed to decide which line in the cache is to be replaced. In the case of set associative caches, a replacement algorithm is needed to decide which line in a set is replaced. The algorithm for deciding which lines should be replaced in a fully associative or set associative cache is typically based on run-time historical data, such as which line is least-recently-used. Alternatively, a replacement algorithm may be based on historical data regarding least-frequently-used. Still other alternatives include first-in first-out, and pseudo-random replacement. Finally, as discussed immediately below, it may be useful to have a replacement algorithm which stores certain lines once and then locks them in place (never replaces them).
In some computer systems, some applications may have higher priority than others so that improving the cache hit rate or guaranteeing cache hits may be more important for some applications than other applications. Likewise, certain data may be more important or more critical than other data. In particular, guaranteed consistent response time for some applications may be critical, even if other applications run slower as a result. Some time critical control applications, for example, cannot use general caching. If an event occurs and it is absolutely necessary for software to respond within a minimum or known constant (deterministic) time, general caching cannot be used because there is always a finite probability that critical code or data is not present in the cache. In some systems, a separate memory structure or buffer is provided that has the speed of a cache, but is dedicated to a specific set of lines or pages. Eviction is not permitted. In other systems, sections of a cache or specific lines in a cache may be locked to hold critical portions of code in the cache. For example, Intel Pentium Pro processors have a Page-Global-Enable (PGE) flag in page-table-entries that provides a mechanism to prevent frequently used pages from being flushed from a TLB. Cyrix MediaGX processors provide locked sections of a cache for critical graphics data and emulation routines, with an extended instruction set for transferring data in and out of locked sections.
Fully associative caches and set associative caches with a least-recently-used replacement algorithm work well for lines having temporal locality. However, data or instruction streams having high spatial locality and low temporal locality can completely flush an associative cache with a least-recently-used replacement algorithm.
One approach to improving the cache hit ratio for data streams having mixed locality is described in G. Kurpanek et al, "PA7200: A PA-RISC Processor with Integrated High Performance MP Bus Interface," COMPCON Digest of Papers, February 1994, pp. 375-382. In Kurpanek et al, lines requested from memory are first loaded into an auxiliary fully associative cache, called an assist cache, in a first-in first-out order. A "spatial-locality" hint can be specified in load and store instructions to indicate that data exhibits spatial locality but not temporal locality. When a data line in the assist cache is evicted, if the line contains the spatial locality hint, the line is flushed back to main memory and is not moved to a main cache. Lines are promoted to a main cache only if the spatial locality hint is not present.
An alternative approach to improving the cache hit ratio for data streams having mixed locality is described in J. A. Rivers, et al, "Reducing Conflicts in Direct-Mapped Caches With A Temporality-Based Design," Proceedings of the 1996 International Conference on Parallel Processing, Vol. 1, pp 154-163. Rivers et al also provide a separate auxiliary cache, called a Non-Temporal (NT) buffer. Blocks requested from memory are first placed into a main cache. Each block in the main cache is then monitored during its lifetime in the main cache to see if any part of each block is referenced again. If no word within a block is reused, the block is tagged as being Non-temporal (NT). If a request for data results in a miss in the main cache and a hit in a secondary cache, and if the NT bit is set for the block being referenced, the block is placed in the NT buffer instead of the main cache. Therefore, blocks are monitored at run time, and blocks exhibiting non-temporal locality are prevented from being placed in the main cache a second time.
Both Kurpanek et al and Rivers et al reduce the probability of a line having low temporal locality of evicting a line from a main cache that has a higher probability of reuse. However, each requires an additional hardware cache structure. There is need for further improvements in cache performance for data streams having mixed locality. There is a need for further improvements in providing partial cache locking. In particular, there is a need for a cache having different replacement algorithms for different classes of items being cached.