1. Technical Field
The present invention relates generally to data processing and more particularly to prefetching data for utilization during data processing. Still more particularly, the present invention relates to a method and system for prefetching data having multiple patterns.
2. Description of Related Art
Prefetching of data for utilization within data processing operations is well-known in the art. Conventional computer systems are designed with a memory hierarchy comprising different memory devices with increasing access latency the further the device is away from the processor. These conventionally-designed processors typically operate at a very high speed and are capable of processing data at such a fast rate that it is necessary to prefetch a sufficient number of cache lines of data from lower level caches (and/or system memory). This prefetching ensures that the data is ready and available for use by the processor.
Data prefetching is a proven, effective way to hide increasing memory latency from the processor's execution units. Conventional systems utilize either a software prefetch method (software prefetching) and/or a hardware implemented prefetch method (hardware prefetching). Software prefetching schemes rely on the programmer/compiler to insert prefetch instructions within the program execution code. Hardware prefetching schemes, in contrast, rely on special hardware to detect patterns in data accesses and responsively generate and issue prefetch requests according to the detected pattern. Because hardware prefetching does not incur instruction overhead and is able to dynamically adapt to program behavior, many hardware prefetching techniques have been proposed over the years. Most of these hardware techniques have shown great success in detecting certain types of access patterns, in particular sequential accesses.
Conventional hardware prefetching schemes utilize history tables to detect patterns. These tables save a number of past accesses and are indexed either by instruction address (i.e., program counter (PC)) or data address. Indexing using PCs works only for streams accessed within loops. However, as compilers continue to perform aggressive optimizations such as loop unrolling, which results in a stream being accessed through multiple instructions, such indexing is becoming less and less attractive.
When indexed by data address, a history table is able to save either virtual addresses or physical addresses. Saving virtual addresses can predict streams across multiple pages but requires accessing the page translation hardware to translate virtual addresses to physical addresses. Because the page translation hardware is in the critical path of the instruction pipeline, significant hardware overhead is required to allow prefetch requests to access the page translation hardware without slowing down the whole pipeline. Consequently, most (and perhaps all) prefetch engines in commercial systems, such as Intel Pentium™ and Xeon™, AMD Athlon™ and Opteron™, Sun UltraSPARC III™, and IBM POWER4™ and POWER5™, are indexed by physical addresses and store physical addresses.
In data operations, several types of stream patterns exist, and each requires a different scheme to detect the particular pattern. These different patterns can be sequential unit stride pattern, non-unit stride pattern, and pointer chasing, among others. Current systems are designed to perform well only on some of these patterns. However, no existing scheme is able to efficiently work on all patterns and typically systems are designed to track only one pattern. Also, all existing systems operate with a single history table for detecting patterns from the tracked requests.
Thus, because of the difficulty of detecting multiple patterns using one table, the prefetch engines of all of the above listed real systems are able to detect only sequential streams (i.e., unit-stride streams) or in some instances small non-unit stride streams. Researchers in the industry have proposed utilizing a complicated history table with a complicated state machine to detect odd access patterns in physical addresses. However the complexity of these designs prevents the designs from being adopted into a real system.
Thus, the prefetch engines of the conventional systems are unable to provide support for more than one of the common patterns in data prefetch operations. Further, there is no existing prefetch scheme that is able to detect both unit and non-unit stride streams without incurring a substantial hit with respect to chip area and additional cost due to the required hardware complexity.