1. Technical Field
The present invention relates to a method of minimizing effective memory latency without unnecessary cost. In particular, the present invention relates to fine-grained software directed data prefetching using integrated high-level, and low-level code analysis and optimizations.
2. Description of Related Art
In conventional computing systems, prefetching is a well known technique for effectively tolerating memory access latency which can adversely affect the performance of applications on modern processors. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. Much of the recent work in the area of prefetching has focused on three dimensions of prefetching effectiveness, which are timeliness, accuracy and overhead. Timeliness is the placement of the prefetches such that the latency to memory is effectively hidden. Accuracy is prefetching data which will actually be used by the program before it is used and reducing prefetches which will not be used and merely pollutes the caches. Overhead involves incurring the least amount of overhead resources incurred by the prefetch instructions themselves.
Data prefetching can be accomplished by software alone, hardware alone or a combination of the two. Software prefetching relies on compile-time analysis to insert and schedule prefetch, or touch, instructions within user programs. But prefetch instructions themselves involve some overhead. Hardware-based prefetching employs special hardware which monitors the storage reference patterns of the application in an attempt to infer prefetching opportunities. It has no instruction overhead, but it is often less accurate than software prefetching because it speculates on future memory accesses without the benefit of compile-time information. The combination of software and hardware prefetching is designed to take advantage of compile-time program information so as to direct the hardware prefetcher while incurring the least amount of software overhead as possible.
The IBM Power4 and Power5 systems have storage hierarchies consisting of three levels of cache and the memory subsystem: on-chip L1 and L2 cache and off-chip L3 cache. They employ hardware data prefetching to identify and automatically prefetch streams without any assistance from software. Still, there are shortcomings associated with hardware prefetching, such as, hardware prefetching does not begin immediately, as it takes several cache misses before a stream is identified. Additionally, hardware supports a limited number of streams to prefetch, if there are more concurrent streams than supported by hardware, a replacement algorithm is employed, and hardware may not prefetch the most profitable streams. Furthermore, hardware may prefetch more data than necessary since it does not know a priori where the end of the stream is.