1. Field of the Invention
The present invention relates generally to cache performance, and more particularly to hardware pre-fetching.
2. Description of Related Art
Hardware prefetching schemes attempt to reduce the number of cache missing memory operations observed by an application by predicting the application's data requirements and using the predictions to bring that data closer to the processor. Ideally, the required data is already available in the processor's caches, when the data is eventually requested by the application. This reduces the time the application spends stalled waiting for data and improves application performance.
To function effectively, hardware prefetchers require sufficient information about the behavior of the application to be able to accurately predict what to prefetch. The prediction mechanisms used by prior art hardware prefetch schemes can be used to classify the schemes into three basic categories: observation schemes; speculative execution schemes; and application assistance schemes.
Observational schemes function by observing the behavior of the executing applications and monitoring an event such as cache misses or cache block utilization. Using this information, the hardware prefetchers attempt to predict which data the application is likely to require in the near future. This category can be further subdivided by considering the mechanisms used to derive prefetch predictions from the observational data.
For applications with structured workloads, such as high-performance and technical computing applications, there is a significant probability that there will be a regular displacement between subsequent missing memory operations (especially when considered with respect to cache line granularity). Stride prefetchers exploit this phenomena, and, by observing the “stride” between subsequent cache misses, predict that this relative displacement between misses will continue, allowing the prefetcher to speculatively prefetch for misses further in the stream.
The simplest stride prefetcher is a unit stride prefetcher. The unit stride prefetcher predicts, given misses for sequential cache lines (ascending or descending), that the subsequent cache lines will also be required, i.e., after observing misses for cache lines L and L+1, the unit stride prefetcher issues prefetches for cache lines L+2, L+3 etc.
More sophisticated stride prefetchers can deal with non-units strides, e.g., a miss for line L, followed by a miss for line L+4. However, with applications concurrently accessing multiple streams, the misses from many independent streams may be interleaved, making the detection of strides problematic.
Stride detection is addressed in two distinct fashions. If the hardware prefetcher has access to the program counter of the cache missing memory operation, misses can be attributed to specific instructions and streams are tracked on a per program counter basis. However, the program counter of the memory operation, which caused the cache miss, is not typically conveyed with the miss request (and it is the sequence of requests that is monitored by the hardware prefetcher). Consequently, prediction schemes that require program counter information can represent significant additional cost and complexity due to the requirement to make this information available to the prefetcher. As a result, several schemes have been proposed to perform this disambiguation without requiring access to program counter information, including minimum delta prediction and the partitioning of memory into regions, with all misses falling within a single region attributed to a single stream.
Stride prefetchers support a single constant stride for each stream, the stride typically being determined during the initial training phase of the prefetcher. Techniques such as a 2-delta stride prediction are frequently leveraged to prevent single “erroneous” misses from causing the stride predictor to deviate from the correct stride. Stride prefetchers also usually leverage techniques to temporarily delay and/or limit prefetching until the stride prefetcher becomes more certain that a new stream has been identified.
Context-based prediction schemes are capable of handling more unstructured behavior, essentially functioning by remembering sequences of past events and prefetching if a sequence is observed to begin repeating. Consequently, these schemes require that the pattern of misses be observed at least once before prefetching can be performed. The events monitored to drive the prefetch predictions vary greatly and can range from the memory addresses of the individual missing memory operations to the eviction and replacement of cache lines or cache tags.
Execution driven schemes function by speculatively executing ahead of the main thread of execution in an attempt to prefetch cache missing memory accesses, such that data is instantiated in the processor's caches in advance of the main thread of execution requiring the data. Execution driven schemes are frustrated by dependent misses. Distributed misses can be problematic, as the speculative execution may never progress far enough ahead to locate the next miss. The memory addresses of dependent misses are dependent on previous missing loads, preventing determination of the memory addresses and execution of the operation.
Application assisted hardware prefetchers derive prefetch information directly from the application. During compilation, in addition to generating the standard application binary, the compiler generates a distilled version of the application, which retains only the instructions necessary to generate the memory addresses of the data to be prefetched. Portions of this distilled application are periodically supplied, by the executing application, to the hardware prefetcher, which executes the code to determine which memory addresses to prefetch.