Today's processors are often equipped with caches that can store copies of the data and instructions stored in some high-capacity memory. A popular example today of such high-capacity memory is dynamic random access memory (DRAM). Herein, the term “memory” will be used to collectively refer to all existing and future memory implementations. Cache memories, or “caches” for short, are typically built from much smaller and much faster memory than other memory implementations and can subsequently only hold copies of a fraction of the data stored in main memories or secondary storage devices at any given time. Today, caches are often implemented using SRAM and large caches can be implemented using DRAM. The caches described herein can be implemented using any existing and future memory technology.
Often, the memory system of a computer system includes a range of caches, with larger and slower caches (here referred to as higher-level caches) close to the main memory and smaller and faster caches (here referred to as lower-level caches) closer to the processor. This configuration is typically referred to as a cache hierarchy, memory hierarchy or memory system. Each level in the cache hierarchy is referred to as a cache level. FIG. 1 depicts a computer system, where each CPU 101 is assigned its own first-level private cache 102 (L1 cache). A private cache is a cache where space for a data unit can only be in response to an activity of a CPU local to the cache (or a CPU in the same node as the cache, where the cache is private to that node). In this example, the second-level cache 103 is shared by all the CPUs 101 and may contain data and instructions that are accessed by all of the CPUs 101. Cache 103 is a global cache, since space for a data unit can be made in response to activities of any CPU 101. Furthermore, cache 103 is a shared cache, since each data unit can only exist in one shared copy (while each private cache can have its own replicated copy). Memory 105 also stores instructions and data that are accessed by all CPUs 101. The (main) memory 105 is typically accessed using a physical address, or PA for short, while the addresses generated by a CPU 101 are typically virtual addresses. Computer systems with several CPUs and several private caches, such as the system shown in FIG. 1, need efficient mechanisms for finding a requested data unit in one of the caches as well as for keeping the multiple copies of a datum stored in different nodes coherent.
Such computer systems will also typically include one or more secondary storage devices in addition to main memory and cache memory. These secondary storage devices can include one or more of hard drives, optical drives, flash drives or the like, which are collectively represented here by disk 104. The disk or secondary storage 104 may store several orders of magnitude more data than the memory 105, but is not directly accessible using a physical address. If a CPU 101 wants to access data stored on the disk 104, a virtual memory system (not shown in FIG. 1) moves a chunk of data, typically referred to as a page, from the disk 104 to the memory 105 and creates a translation from virtual addresses corresponding to that page to physical addresses. A special kind of translation cache (not shown in FIG. 1), typically called a “translation look-aside buffer” or TLB for short, caches the translation mappings from virtual pages to physical pages. The virtual memory system can be viewed as a caching system, which stores a portion of the data located in the disk 104 in the memory 105.
FIG. 2 shows a more detailed view of the cache organization from FIG. 1, where the CPU 204 is shown to contain several cache-like structures, such as the Instruction TLB (ITLB, 201) and Data TLB (DTLDB, 205), and a CPU core 202. The level 1 cache 102 is shown as including an instruction cache 200 and a data cache 206, interposed between the CPU 204 and the level 2 (and potentially higher level) caches 207. FIG. 2 also shows a high-capacity memory 208 connected to the cache hierarchy.
When a CPU 204 requests data that is present in the cache, referred to as a cache hit, that request can be serviced much faster than an access to data that is not present in the cache, referred to as a cache miss. Typically, an application running on the CPU 204 that experiences fewer cache misses will execute more quickly and consume less energy than if the same application which suffers from more cache misses, assuming that the two versions of the application otherwise have similar properties. Therefore, considerable effort has been invested in techniques for avoiding cache misses, and also to design implementations that can make cache hits and cache misses efficient with respect to both their access latency and the energy that these memory transactions consume.
In order to understand how to develop solutions to this cache hit/miss problem, some understanding of cache organization and terminology will be useful. Typically, data are installed into caches in fixed chunks that are larger than the word size of a processor, known as cachelines. Common cacheline sizes today are, for example, 32, 64 and 128 bytes, but as will be appreciated by those skilled in the art both larger and smaller cacheline sizes exist for various cache implementations. The cacheline size may also be variable for some cache implementations.
A common way to organize the data placement in memory 208 is in a manner that each data word is statically mapped to reside in one specific cacheline. Each cache typically has an index function that identifies a portion of the cache where each cacheline can reside, known as a set. The set may contain space to hold one or more cachelines at the same time. The number of cachelines which the set can hold is referred to as its associativity. Often, the associativity for all of the sets in a cache is the same. Such caches are often referred to as set-associative caches. The associativity may also vary between the sets. Each of the spaces in a set that can hold a cacheline is referred to as a way.
In order to determine the identity of each cacheline stored in each set, cachelines in a cache each have some identifier associated with them. One common example of such an identifier is an address tag. When looking for a specific cacheline in a cache, its address may be used to determine the set of the cache where it may reside. The address tags of the cachelines of the corresponding set are compared to a tag portion of the address used to identify a cacheline (see, e.g., TAG and ATAG depicted in FIG. 3 and described below) to determine if the desired cacheline resides in the cache, and if so, in which way it resides, i.e., in which of the spaces in a set that can hold a cacheline it resides.
Often, each cache has built-in strategies to determine which cacheline to keep in the set and which cacheline to evict (also referred to as replace) to make space for a new cacheline being brought into the set, referred to as its replacement policy. The replaced cacheline is known as a victim or a replaced cacheline. Popular replacement policies used in conjunction with cache memories include, but are not limited to, least-recently used (LRU), pseudo-LRU and random replacement policies.
So-called inclusive cache hierarchies require that a copy of a block of data (for example a cacheline) present in one cache level, for example a block of data in the L1 cache, also exists in the higher cache levels (here, higher refers to cache levels with higher numbers than 1), for example in the L2 and L3 cache. Exclusive cache hierarchies only have one copy of the block of data (for example a cacheline) existing in the entire cache hierarchy, while non-inclusive hierarchies can have a mixture of both strategies. In exclusive and non-inclusive cache hierarchies, it is common that a cacheline is installed in the next higher cache level upon eviction from a given cache level.
With this background on caches and cachelines in mind, this discussion now proceeds to illustrate one conventional technique for identifying specific cache locations within a cache hierarchy like that shown in FIGS. 1 and 2. FIG. 3 shows an example of the organization of a computer system node containing a CPU 301, a DTLB 303, an L1 CACHE 307 and an L2 CACHE 311. The CPU 301 generates a memory request containing a virtual address 302 (VADDR), which is used to perform an associative lookup in the DTLB 303. The virtual address 302 is in this example divided into three parts: P-OFFSET (page offset, for example consisting of the low-level bits), TAG and INDEX. The INDEX portion of the virtual address 302 is used to identify the set within the DTLB 303 where the address translation may be stored.
The illustrated DTLB 303 used in this example has a two-way associative organization, showing two entries of the identified set. Each entry consists of an address tag (ATAG) 304, and physical page frame (PPF) 305. The TAG portion of the virtual address 302 is compared with the ATAGs 304 of each entry of the identified set. The logic 318 determines if there is match for any of the entries, and, if so, controls the MUX 306 to select the corresponding PPF 305. A physical address PADDR 316 is composed by concatenating the PPF 305 selected by MUX 306 with the P-OFFSET portion of the virtual address 302. Alternatively, if neither of the entries of the identified set match the TAG portion of the virtual address 302, then a TLB fill operation is performed and the needed translation entry is brought into the DTLB 303.
The PADDR 316 is used to perform a lookup in the L1 CACHE 307. More specifically, an INDEX portion of the PADDR 316 is used to identify the set where the cacheline containing the requested data may be stored. (It should be noted that this L1 INDEX may contain different bits than the DTLB INDEX). The depicted L1 CACHE 307 has a 2-way associative organization with two entries in the identified set. Each entry consists of an address tag (ATAG) 308, and data 309 of the corresponding cacheline. A TAG portion of the PADDR 316 is compared with the ATAGs of each entry of the identified set. The logic 319 determines if there is match for any of the entries, and controls the MUX 310 to select the corresponding data. If neither of the entries matches (depicted by L1 MISS 317) a lookup in the L2 CACHE 311 is needed. While the division of PADDR 316 into INDEX and TAG portions for the L2 CACHE lookup may differ from the division done for the L1 CACHE lookup, the remaining steps for the L2 CACHE lookup are typically similar to those performed for the lookup in the L1 CACHE. If a miss in the L2 CACHE is determined, new lookups in higher-level caches or accesses to memory may be needed. It can be noted that if the INDEX portion of PADDR 316 used by the L1 cache is entirely composed by P_OFFSET bits, the access to the L1 cache 307 can be started before the PPF information coming from the MUX 306 is available. This is often referred to as a virtually indexed, physically tagged cache (VIPT).
It would be appreciated by someone skilled in the art that several “nodes”, each with a memory hierarchy similar to the one depicted in FIG. 3, can be connected together to form a coherent multiprocessor system. Another example of a multiprocessor is shown in FIG. 1, where each CPU has its own private L1 cache. This further complicates the task of locating where a requested data resides since it is no longer sufficient to always search for the data in the next higher cache level. Mechanisms for searching caches of the other nodes as well as for keeping the multiple copies of a datum stored in different nodes coherent are needed.
To summarize the discussion associated with FIG. 3, it will be appreciated by those skilled in the art that contemporary implementations of a cache hierarchy may require many associative lookups to be performed in TLBs as well as at different cache levels before the location of a requested cacheline can be determined. Furthermore, each translation entry and cached cacheline needs to be accompanied by an address tag and a plurality of these tags need to be compared with a portion of an address at each of these levels. Overall, this implementation comes at an enormous cost in terms of the dynamic energy used for each associative lookup. It also requires substantial overhead in static energy and memory capacity to store the extra tag information that is needed to perform the comparisons. Furthermore, multiprocessors containing several CPUs or several nodes, each with their own private cache hierarchies require additional costly search mechanisms for locating data residing in the private caches as well as the need for costly coherence activities to keep replicated copies of the same date in the private cache coherent.
Accordingly, it would be desirable to provide systems and methods that avoid the afore-described problems and drawbacks associated with cache lookups for data.