1. Field of the Invention
The present invention relates to compiler-based techniques for optimizing the performance of computer programs within computer systems. More specifically, the present invention relates to a method and an apparatus that uses value speculation to break constraining dependencies in iterative control flow structures, such as loops.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
Efficient caching schemes can help reduce the number of memory accesses that are performed. However, when a memory reference, such as a load operation generates a cache miss, the subsequent access to memory can take hundreds of clock cycles to complete, during which time the processor is typically idle, performing no useful work.
The majority of the cache misses occur in iterative control flow structures (or simply, loops). Existing hardware and software prefetching techniques can effectively prefetch data and/or instructions for simple counted loops and for regular strided data streams. However, many commercial applications, such as databases, execute more complicated loops, that derive little benefit from conventional prefetching techniques. Inside these more complicated loops, the values of missing loads are often used to determine branch conditions (which creates a control dependence) or to perform other computations (which creates a data dependence). This causes each iteration of the loop to wait until the constraining control/data dependences (from the missing loads to their uses) are resolved, before proceeding with the next iteration. Thus, these circular dependence chains limit how many iterations (and consequently how many cache misses) can be executed in parallel.
Hence, what is needed is a method and an apparatus for prefetching load values (and for eliminating other constraining control and/or data dependencies) for more complicated loops.