The computer industry constantly strives to provide hardware and software which perform according to the increasing expectations of consumers. One aspect of improved computer performance involves the speed at which instructions and data can be accessed. The more quickly instructions and data can be retrieved from main memory, and the more quickly data can be stored to main memory, the more efficiently a computer operates.
Data caches serve to maintain frequently-accessed data values, so that a processor can load and store them much more quickly than would be possible by interacting directly with the main memory. Instruction caches serve a similar purpose, maintaining one or more series of instructions which are frequently used during the execution of a given program.
While the use of cache storage speeds up execution significantly, it is not without problems. All caches are limited in size, and eventually, a data value or instruction must be retrieved directly from main memory because it is not located in the cache store. This is known as a cache “miss”, and results in the time-consuming operation of waiting for the cache to interact directly with main memory.
Performance can also be improved by pipelining the execution of instructions, so that several instructions are executed at approximately the same time. However, when branch instructions are processed, pipeline performance can be severely degraded, as the pipeline must wait for determination of the branch instruction result. In the search for ever higher levels of performance, research has been conducted in the area of predicting branch results instead of waiting for the branch instruction to be completed, using a branch predictor to store predicted values, for example. However, whenever a branch value is predicted incorrectly, even more performance is lost.
For the purposes of this document, branch mispredictions and cache misses will be known as “long latency events”, and the instructions which cause them to occur will be known as “performance degrading long latency instructions”. To avoid the occurrence of long latency events, the idea of executing them ahead of main program execution has emerged. Thus, executing a main program thread, along with a run-ahead thread allows long latency event to be resolved ahead of the main program execution.
A fundamental problem then arises: the run-ahead thread itself should not be slowed by the existence of long latency events. Otherwise, execution of the main program thread may quickly catch up to the run-ahead thread. An analytical model may be used to explain the significance of the problem.
Assume that resolving long latency events (i.e., branch mispredictions, and data and instruction cache misses) takes a portion K of the total number of execution cycles. Further, assume that some portion S of the total number of execution cycles is required to ensure proper execution of performance degrading long latency instructions. If the run-ahead thread must wait for cache misses, the run-ahead thread will take the portion K+S of the total execution cycles to do its job. Thus, the performance increase which might be achieved is about 1/max(1−K, K+S). Assuming that S=20% and K=40% (those skilled in the art will realize that K may easily change based on the type of program and execution environment), the increase is limited to 1.67, or 67%. If K is increased to 50%, which should give more time to further improve performance, the theoretical performance increase in fact degrades to 1.43 times, or 43%. Since the value of K is often greater than 40% in actual program execution, it appears that a single run-ahead execution thread which must wait for the resolution of long latency events has a limited performance potential.
To resolve this problem, multiple run-ahead execution threads can be implemented. Examples of this solution are given by J. Collins, et al., in “Speculative Precomputation: Long-range Prefetching of Delinquent Loads,” published in the proceedings of the International Symposium of Computer Architecture, 2001; by C. Zilles and G. Sohi in “Execution-based Prediction Using Speculative Slices,” published in the proceedings of the International Symposium of Computer Architecture, 2001; and by C. Luk in “Tolerating Memory Latency Through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors,” also published in the proceedings of the International Symposium of Computer Architecture, 2001. However, multiple threads incur performance penalties of their own, including thread management and context switching. In addition, the portions of program execution required for different long latency events often overlap. Simultaneous execution in this case is redundant.
Therefore, there is a need in the art for an apparatus, an article including a machine-accessible medium, a computer, and a method of processing data which reduce the effect of long latency events on data processing using run-ahead execution. This need is especially acute with regard to pipelined data processing as it is affected by branch instruction misprediction, and instruction/data cache misses.