1. Field of the Invention
The present invention is directed to computer systems. More particularly, it is directed to techniques to reduce latency of memory access operations within instruction streams executed at computer processors.
2. Description of the Related Art
Microprocessor speeds have been increasing dramatically as the underlying semiconductor technology has advanced. Central processing units used in single-user workstations and even in laptops today are often, at least in terms of clock rates, several times faster than the fastest processors in use just a few years ago. However, changes in processor clock rates do not always result in similar relative improvements in application performance as perceived by end users. A number of factors may affect overall application performance in addition to processor clock rates, among which one of the more important factors may be latency to memory: that is, the time it may take to transfer data and instructions between the memory hierarchy and the processor at which the instructions manipulating the data are executed. Improvements in memory access times have in general not kept pace with improvements in processor speeds. If processors frequently have to wait for data or instructions to be received from memory, many processor cycles may be wasted instead of being used for doing “useful” work, thus reducing the impact of faster clock rates on application performance.
A variety of techniques have been developed in attempts to address the mismatch between memory latency and processor speeds. For example, a number of hierarchical cache architectures have been developed to store frequently accessed data and instructions closer to the processors than main memory. However, cache implementations usually involve tradeoffs between cache size and proximity to the processors; it may not always be feasible to implement large enough caches sufficiently close (in terms of access latency) to the processors to overcome the memory latency problems. In many processor architectures, for example, relatively small and fast Level-1 (L1) caches may be employed, together with larger but not as fast Level-2 (L2) caches. Lookups for data and/or instructions may be performed hierarchically. First, the L1 cache may be examined; if the data/instructions are not found in the L1 cache, the L2 cache may be examined; and if the data/instructions are not found in the L2 cache, the data/instructions may be fetched from main memory. More than two layers of caches may be implemented in some processor architectures. While accesses to L1 caches may be fast (e.g., a few processor cycles), the latency to the L2 caches may still be sufficiently large with respect to processor clock rate (e.g., tens or hundreds of processor cycles) that for many applications, the cache latency (as well as the latency to main memory) may still have a significant impact on overall application throughput.
In another complementary approach to the memory latency problem, some processor architectures may support prefetch instructions that allow data to be fetched from memory prior to the time that it may be needed, thus masking at least some of the effects of long memory access times. Such prefetch instructions may typically be inserted into applications at code generation time (e.g., at compile time). However, determining exactly which memory reference candidates are the best candidates for prefetch may be hard, especially when only binary or compiled versions of the application code are available for instrumentation. In addition, when more than one technique for inserting prefetch instructions into program code may be available, it may be difficult to determine the relative efficiency of the various techniques.