1. Field of the Invention
The present invention relates to the field of computers. More specifically, the present invention relates to computer architecture.
2. Description of the Related Art
The performance of general-purpose processors is increasingly burdened by the latency to access high-latency memory, such as main memory and off-chip cache. Access latency is incurred whenever a load or store instruction misses all of the processor's low-latency caches. This latency is continually increasing because the speed of processors is increasing faster than the speed of main memory and faster than the speed of off chip-caches. Due to the large latency to access main memory or off-chip caches, a load instruction that requires an access to such high-latency memory will likely cause the processor to stall until the load data returns. This stalling of the processor causes severe performance degradation. The processor stalls because it cannot find and execute enough instructions that are independent of the stalling load instruction to effectively conceal the access latency to the off-chip cache or main memory.
Generally, two approaches have been applied to the problem of processor performance degradation arising from access of high-latency memory. The first approach utilizes prefetching, which requires address computation and address prediction. Data is prefetched from the high-latency memory into low-latency memory. In order to prefetch the data, a compiler or hardware predicts the addresses of the value to be prefetched. However, address prediction can be difficult, and address computation consumes valuable resources.
The second approach utilizes multithreading. If a thread stalls while waiting for data to arrive from high-latency memory, then the processor switches to a different thread. Two-way and four-way multithreading can be utilized to effectively hide memory latency in applications with sufficient thread-level parallelism. However, four-way multithreading may be inadequate for some applications, and scalability of multithreading is limited by the processor. Going beyond four-way multithreading may require additional chip resources and/or increased design complexity.
In addition to the two generally applied approaches, a relatively new technique, value prediction, is being proffered to increase instruction-level parallelism by breaking true dependence chains. Value prediction techniques predict the resulting value for an instruction, and speculatively execute dependent instructions with the predicted value. Value prediction has been applied for all instruction types and for all load type instructions. In Selective Value Prediction, by Brad Calder et al., Proceedings of the 26th International Symposium on Computer Architecture (May 1999), a theoretical value prediction technique is investigated that filters both producer instructions (instructions that produce value predictions) and consumer instructions (dependent instructions that use predicted values as input operands). Based on the instruction filtering, values of particular instructions are installed in a value prediction table. The filtering is based on instruction type, as well as instruction priority. Priority is given to those instructions belonging to the longest data dependence path and the processor's active instruction window. Unfortunately, these value prediction techniques suffer when applied to real world applications.
The value prediction techniques utilize very large value prediction tables to be accommodated by a host processor, thus increasing processor complexity. In addition to these large value prediction tables, value prediction techniques, such as that proposed in Selective Value Prediction, require complex filtering and prioritization mechanisms that further complicate value prediction in an attempt to more efficiently utilize the large value prediction table. Accordingly, a technique that effectively conceals latency incurred from data requests is desirable.