1. Field of the Invention
The present invention relates to latency hiding in computer programs and, in particular, to techniques for scheduling code that includes pre-executable operations, such as prefetches and/or speculative loads, to improve execution performance.
2. Description of the Related Art
Computer systems typically include, amongst other things, a memory system and one or more processors and/or execution units. The memory system serves as a repository of information, while a processor reads information from the memory system, operates on it, and stores it back. As processor speeds and sizes of memory systems have increased, the mismatch between the ability of the processor to address arbitrary stored information and the ability of the memory system to provide it has increased. To address this mismatch, memory systems are typically organized as a hierarchy using caching techniques that are well understood in the art.
In general, caches can be used to reduce average latency problems when accessing (e.g., reading or writing) main memory. A cache is typically a small, specially configured, high-speed memory that represents a small portion of the information represented in main memory. By placing the cache (small, relatively fast, expensive memory) between main memory (large, relatively slow memory) and the processor, the memory system as a whole system is able to satisfy a substantial number of requests from the processor at the speed of the cache, thereby reducing the overall latency of the system. Some systems may define multiple levels of cache.
When the data requested by the processor is in the cache (known as a “hit”), the request is satisfied at the speed of the cache. However, when the data requested by the processor is not in the cache (known as a “miss”), the processor must wait until the data is provided from the slower main memory, resulting in greater latency. Typically, useful work is stalled while data is supplied from main memory. As is well known in the art, the frequency of cache misses is much higher in some applications or execution runs than in others. In particular, accesses for some database systems tend to miss in the cache with higher frequency than some scientific or engineering applications. In general, such variation in cache miss frequencies can be traced to differing spatial and temporal locality characteristics of the memory access sequences. In some scientific or engineering applications, particularly those characterized by array accesses, hardware techniques can be employed to predict subsequent accesses. However, in many applications, it is difficult for hardware to discern and predict memory access sequences.
To increase the likelihood of cache hits and thereby improve apparent memory access latency, some computer systems define instructions for prefetching data from memory to cache. The assumption is that software (e.g., either the programmer or a compiler) may be in a better position to identify prefetch opportunities. To this end, some instructions set architectures such as the SPARC® V9 instruction set architecture support software prefetch instructions. SPARC architecture based processors are available from Sun Microsystems, Inc, Palo Alto, Calif. SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems.
Effective use of prefetch instructions is often difficult. Indeed, access patterns for many applications, including database applications, often include chains of successive dependent accesses where, in general, no spatial locality can be presumed. For example, consider the following instruction sequence:
LD [R21], R22LD [R22], R23LD [R23], R24in which successive loads each depend on address values loaded by a prior instruction. These chains of successive dependent accesses are commonly known as address chains. These and other sources of dependency tend to complicate the use of prefetch techniques.
As a result, prefetch instructions are often not used at all, or are used with little or no intelligence, adding little in the way of added performance. Because the level of knowledge concerning the processor and its memory, which is typically required for effective use is substantial, use of prefetch instructions is generally left to compilers. For compilers or other code preparation facilities to effectively use prefetch instructions, techniques are needed whereby prefetches may be placed to improve overall memory access latency. Techniques that hide memory access latency of addressing chains are particularly desirable. Further, while memory access latencies and placement of prefetch instructions provide a useful context for development of latency hiding techniques, more generally, techniques are desired whereby pre-executable portions of operations (including prefetch instructions) may be placed to improve overall latency in instruction sequences that include operations that are likely to stall. In short, load instructions and prefetch operations are but one example of a more general problem for which solutions are desired.