The typical computer has a random access memory hierarchy including one or more levels of on-processor cache memory, a main memory (located off of the processor chip) and a mass storage device (e.g., a hard disk drive, etc.). Typically, accessing the first level of cache memory (L1 cache) is fastest (i.e., has the lowest latency) and accessing the mass storage device is slowest. The latencies associated with accessing intermediate levels of the memory hierarchy fall between these two extremes of memory access latencies. In addition to increasing in latency time, the various levels of the memory hierarchy typically increase in size from the highest level of the memory hierarchy to the lowest level of the memory hierarchy.
Modern high performance processors (for example, the Intel Itanium™ family of processors and other EPIC (Explicitly Parallel Instruction Computing) processors have multiple levels of on-chip cache memory. For example, the Itanium® processor includes three levels of on-chip cache. Because the operating frequency of future processors is extremely high, in order to support a one cycle load from the memory system to a register of a high speed processor, the first level of the cache (i.e., the L1 cache referred to herein as “μ cache”) is typically small in storage size. For example, a μ cache typically has the capacity to store 1 K (kilobyte) or less of data. The L1 cache may comprise a single μ cache or a set of parallel μ caches (e.g., a plurality of μ caches of varying sizes and latencies).
Proper management of the small and fast μ caches is important to the overall performance of the host processor they serve. In particular, in many instances a significant number of load instructions need to immediately retrieve data from the memory system to advance program execution without suffering a pipeline stall. Such instructions benefit if the data they require is stored in one of the μ cache(s).
In the typical case, cache memory has an inclusive nature. Thus, when data is retrieved from a given level of the memory system (e.g., the set of parallel μ caches), it is written into all lower levels of the cache (e.g., the level 2 (L2) cache, the level 3 (L3) cache, etc). This practice maximizes the likelihood that data needed for a later instruction is present in the highest levels of the cache, thereby reducing the number of accesses to slower memory resources and the number of cache misses (i.e., a failed attempt to retrieve data from a cache level that does not contain the desired data).