1. Technical Field
The present invention relates generally to data processing systems and specifically to prefetching data to a cache. Still more particularly, the present invention relates to an improved system and method of identifying and prefetching a cache block to a cache memory.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Because multiple processor cores may request write access to a same cache line of data and because modified cache lines are not immediately synchronized with system memory, the cache hierarchies of multiprocessor computer systems typically implement a cache coherency protocol to ensure at least a minimum level of coherence among the various processor core's “views” of the contents of system memory. In particular, cache coherency requires, at a minimum, that after a processing unit accesses a copy of a memory block and subsequently accesses an updated copy of the memory block, the processing unit cannot again access the old copy of the memory block.
In some systems, the cache hierarchy includes multiple levels, with each lower level generally having a successively longer access latency. Thus, a level one (L1) cache generally has a lower access latency than a level two (L2) cache, which in turn has a lower access latency than a level three (L3) cache.
The level one (L1) or upper-level cache is usually a private cache associated with a particular processor core in an MP system. Because of the low access latencies of L1 caches, a processor core first attempts to service memory access requests in its L1 cache. If the requested data is not present in the L1 cache or is not associated with a coherency state permitting the memory access request to be serviced without further communication, the processor core then transmits the memory access request to one or more lower-level caches (e.g., level two (L2) or level three (L3) caches) for the requested data.
Typically, when a congruence class of an upper-level cache becomes full, cache lines are removed (“evicted”) and may be written to a lower-level cache or to system memory for storage. In some cases, a lower level cache (e.g., an L3 cache) is configured as a “victim” cache, which conventionally means that the lower level cache is entirely populated with cache lines evicted from one or more higher level caches in the cache hierarchy rather than by memory blocks retrieved by an associated processor. Data is typically managed in conventional victim caches using a least recently used (LRU) cast-out mechanism, as the structure is prefetched into the cache, the oldest data blocks of a target set are cast out. Eventually, all data that previously resided in the set is replaced by the prefetched data. Because the use of the data structure is temporally limited, casting out the oldest data is not the optimal choice, since the newest prefetched data is not used again in the near future. Furthermore, when an access pattern exhibits low spacial locality, much of the memory bus bandwidth required to prefetch an entire cache block is waste, as a portion of the data passed over the memory bus may never be used. When a systems performance is bus-bandwidth limited, the advantage of prefetching may be negated by the latency introduced by inefficient bus utilization.