Main memory access speed has not increased as fast as the clock speeds of modern microprocessors. As a result, the latency of memory access in terms of processor clock cycles has increased over time. These long latencies result in performance degradation when a processor must access main memory. Computer architects have mitigated this problem through the introduction of one or more caches associated with each processor, such as Level 2 cache, built using SRAM memories whose access latencies are much lower than the latencies of main memory. Caches are small images of a portion of main memory. These caches store the data most recently touched by a processor, and when a processor touches (reads/writes/updates) this data again, its performance is improved by avoiding main memory access stalls. Cache access is much faster than main memory access, therefore cache access should be maximized. However, cache is more expensive than main memory, consequently it has less capacity than main memory. The problem becomes how best to maximize cache utilization while working within a small space.
Referring now in specific detail to the drawings, and particularly FIG. 1, there is illustrated a conventional cache implementation 100. Hardware caches, as they are conventionally built by various computer manufacturers, consist of a data array 130 containing the data stored in the cache, a tag array 120 containing meta-data that identifies the set of data currently residing in the data array, and attributes for each data item (e.g. least recently used “LRU” state, coherence permissions), a decoder 140 which is used to map an address to the set of the cache that may contain that address, and a cache controller 110, which is a finite-state machine that controls the refilling of the cache as a response to processor requests that miss in the cache.
Referring to FIG. 2 there is shown a flow chart of the typical flow of a single request performed by the processor, according to the known art. The process begins at step 210 with the processor initiating a request by driving a request address in conjunction with a request signal to the cache's decoder 140. In step 220 the cache decoder 140 determines the set number to which this address corresponds, and drives a signal to that set. In step 230 this signal activates the tag-match logic corresponding to that particular set, which compares the remaining bits of the processor's address to the current contents of that set.
In step 240 in the event of a cache miss, the processing proceeds to step 270 where the cache hit signal is held to zero, and a cache refill is initiated by the cache controller. If, however, there is a cache hit, then in step 250 the cache hit signal is asserted, and a signal is sent to the data array for that cache block. Next, in step 260 the data from the data array is read and sent to the requesting processor. In most cache implementations, the tag metadata contains the upper-order bits of each cache block's physical address.
The drawback with these caches is that their capacity is small relative to the size of main memory; consequently, their contents must be carefully managed to maximize the probability that a processor's future memory accesses will be available from the cache. Rather than simply retaining the most recently touched data, many processors implement prefetching mechanisms that predict those memory locations that a processor will reference in the future, and preload this data into the cache in preparation for the processor's upcoming demands. These prefetching mechanisms can be categorized as software prefetchers or hardware prefetchers.
Software prefetching is supported by the processor through one or more special prefetch instructions that are inserted into a program's instruction sequence by a programmer, compiler, or run-time system, based on some knowledge of the application's future memory reference pattern. A prefetch instruction causes the processor to preload a memory location into the processor's cache, without stalling the processor while it is being loaded. Unlike software prefetching, hardware prefetchers operate independently of any software control. By monitoring either the pattern of memory accesses being performed by a processor, or the pattern of cache miss requests from a cache, these prefetchers can predict a processor's future memory access pattern, and preload this data without any support from the programmer, compiler, or run-time system.
In many scenarios, the benefits of known hardware and software prefetchers are limited, because a program's reference pattern may be difficult to determine or summarize in a manner amenable to either hardware or software. Also, an application's recent memory reference history may not be indicative of its future memory reference pattern. For example, many applications exhibit phase behavior, in which the application's working set of memory locations and memory access patterns are consistent within one phase of time, however may vary wildly across different phases. These phases may also be periodic, following a predictable sequence of phases (e.g. ABCABCABC, where each letter represents a distinct phase of execution). When phases change, the hardware prefetcher may not remember the reference pattern that occurred during a previous execution of the new phase, and must incur a training period during which its effectiveness is limited.
Therefore, there is a need for a prefetch mechanism to overcome the stated shortcomings of the known art.