1. Field of the Invention
The present invention relates generally to the data processing field, and more particularly, to a computer implemented method, system and computer program product for prefetching data in a data processing system.
2. Description of the Related Art
Hardware data prefetchers have been employed in modern data processing systems to anticipate and mitigate the substantial latency in retrieving data from lower level caches and other data storage media. This latency, which is the total number of processor cycles required to retrieve data from a data storage medium, has been growing rapidly as processor frequencies have increased without a corresponding improvement in the time required to access a data storage medium.
In response to this rapid growth in latency, stream hardware data prefetchers have been employed to detect data streams (any sequence of storage references that reference a contiguous set of cache blocks in a monotonically increasing or decreasing manner), and to begin prefetching data up to a predetermined depth, i.e. a predetermined number of cache blocks ahead of the data which the processing system is currently loading. The detection of a stream requires at least two cache misses to consecutive lines; and to avoid prefetching cache lines which may not be needed, prefetching for a newly detected stream generally begins conservatively and the depth of prefetching is gradually increased until a specified depth is achieved.
Modern stream prefetchers can track and prefetch multiple streams simultaneously by using a plurality of stream registers, where each stream register contains the address that identifies the state of the current data stream as well as other information such as the direction of the stream, the number of lines consumed since the start of the stream, and the generated but not yet scheduled prefetches.
Although the prefetch approach described above has worked effectively to date, the startup penalty (the two initial misses) and the ramp-up phase (where the prefetches may not be deep enough to ensure timely delivery of data to the application) have an associated performance penalty which grows as memory latency grows. Applications which tend to have short to medium stream lengths benefit only slightly from the data prefetcher, if at all, because of the startup penalty. For such situations, it is desirable to mitigate as much of the startup penalty as possible.
Data stream prefetchers hold some historical information about data streams that are currently defined in any of their stream registers as described earlier; i.e. the current data address, the direction, the next line to be prefetched and pending prefetches that have been generated but not sent. When a new stream displaces a current stream, however, this information is lost when these fields are overwritten with the new stream allocation.
Retaining some of this historical information (e.g. startup data address) to be used the next time the same data stream is detected in a subsequent iteration of the application code would help achieve the objective of lowering the startup penalty. For typical applications, however, this would require a huge table in order to be effective as there tends to be enormous numbers of data streams generated in the course of a typical application. For example, one double-nested or triple-nested loop structure alone can easily generate thousands of streams.
A solution to this problem is to allow the compiler to insert special instructions into the object code to direct the hardware prefetch engine to begin a stream at a specified address and to quickly ramp to the desired depth. This approach, however, requires applications to be recompiled and is limited to the imperfect knowledge that the compiler has at compilation time about the nature of the streams and their caching (since array sizes are often runtime variables).
It would, accordingly, be desirable to provide a mechanism for improving prefetching performance in a data processing system which is effective on existing binaries that have not been compiled using stream prefetch instructions in order to achieve better general exploitation of data prefetching hardware in the data processing system.