The present invention relates in general to processor cache systems, and in particular to a shared cache that uses a client-specific replacement policy.
Most computer systems in use today include a processor and a memory device. The processor executes operations, and the memory device stores information needed by the processor, including instructions identifying the operations to execute, data to be operated on, and data resulting from operations. The instructions generally include memory access instructions for reading data from and writing data to the memory device.
Frequently, the memory is managed using virtual addressing, which enables shared memory management to be separated from program design. Systems that use virtual addressing generally include a page table that provides mapping information usable to translate virtual addresses (which are used in program instructions) to physical addresses (which designate specific locations in a memory device) during execution of instructions. The page table is usually stored in system memory at a physical address known to the processor, and during execution of a memory access command, the processor first accesses the page table to obtain the mapping information, then translates the virtual address to a physical address and accesses the memory again using the physical address.
To reduce the average latency associated with memory instructions, the processor typically includes a translation lookaside buffer (TLB). The TLB includes a cache of previously retrieved mapping information from the page table. The cache contains a number of entries, each representing a mapping from virtual address space to physical address space. Typically, each cache entry includes the virtual address (or a portion thereof) as a tag associated with the corresponding mapping information, which might be a physical address or other information from which a physical address can be determined. When a translation of a virtual address is requested, the TLB performs an associative lookup based on the virtual address to determine whether the mapping information is present in the cache. If the information is present (a “cache hit”), the TLB uses the cached information to perform the translation without accessing the page table. If the information is not present (a “cache miss”), the TLB accesses the page table to retrieve the mapping information and adds the retrieved information to the cache for possible reuse.
To provide high speed, the cache size is usually limited to a relatively small number of entries, and once the cache is full, a new entry can be stored only by evicting and replacing a previous entry. The choice of which entry to replace is generally made based on which entry is least likely to be used again. Commonly, entries are selected for replacement based on recency of use, with the least recently used entry being selected for replacement. To reliably identify the least recently used (LRU) entry, each cache entry typically includes (or is associated with) LRU data representing how recently that entry was accessed, relative to the other cache entries. Counters, bit masks, or the like are commonly used. Each time a cache hit occurs, the LRU data associated with various entries is updated to indicate that the entry that hit is now the most recently used. Each time a cache miss occurs, the LRU data is used to identify the least recently used entry, which is replaced with the newly retrieved mapping information, and the LRU data is updated to indicate that the new entry is now the most recently used.
Cache systems can be as large as desired and can include multiple levels. For instance, many TLB systems use a two-level cache, with a relatively small and very fast Level 1 (L1) cache backed by a larger and somewhat slower Level 2 (L2) cache. In the event of a cache miss at L1, the L2 cache is checked; the page table is accessed only if a miss occurs at L2 as well. The L1 and L2 caches each operate using separate LRU data.
A TLB improves performance to the extent that it reduces the need to access the page table. The improvement is generally a reflection of the “hit rate,” i.e., the fraction of translation requests that result in a cache hit. The hit rate will tend to be higher when the successive virtual addresses being accessed are localized, i.e., near each other in address space, so that one page table mapping can be retrieved once, cached, and reused from the cache to satisfy multiple translation requests. The extent to which successive requests are localized depends largely on the nature of the instruction stream and can vary considerably from one application or process to another.
TLB performance can be significantly undermined in some processor architectures that support execution of multiple instruction streams in parallel, e.g., using multiple hardware cores or multiple threads that share a core. Each instruction stream typically includes its own stream of memory access requests that has no expected correlation to activity in other streams. For instance, in a graphics processor, a stream of requests for texture data might be generated in parallel with a stream of requests for pixels to be displayed.
If the different parallel processes use the same TLB for address translations, they tend to compete with each other for space in the cache. For instance, virtual addressing is used in some graphics processors that access system memory. Such processors typically run a display (scanout) process that accesses pixel data in a highly localized manner, e.g., sequentially in the virtual address space. But such processors also run other processes whose memory access patterns exhibit far less localization, such as texture processes. Under some conditions, mapping information retrieved in response to requests from the texture process can evict cache entries that are still being used to respond to requests from the display process, which increases the cache miss rate of the display requests and also causes the same data to be repeatedly retrieved from the page table rather than reused from the cache. This thrashing behavior, like all types of thrashing behavior, hurts system performance and is generally not desired.
In some parallel processing systems, thrashing between processes is avoided by physically or logically dividing the cache and allocating different caches, or different portions of the cache, to different clients of the TLB, where “client” refers to a process, thread, execution core or the like whose memory requests are serviced by the TLB. For instance, in a graphics processor the texture client might be allocated its own cache, while display and other well-behaved clients are allocated a different cache. This arrangement prevents the texture client from evicting cache entries that are still actively being used by the other clients. However, when any client requests a mapping that is not stored in the cache allocated to that client, a cache miss occurs even if the mapping happens to be stored in a cache allocated to another client. This decreases the overall cache hit rate. In addition, as a result of such a miss, a duplicate of the mapping that is already in the other client's cache is added to the requesting client's cache, making inefficient use of the limited cache capacity and further decreasing the overall hit rate.
It would therefore be desirable to provide a cache system capable of more efficiently handling requests from multiple clients.