1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to a method and an apparatus for performing value speculation on an assist processor in order to prefetch data and/or instructions from memory for subsequent use by a primary processor.
2. Related Art
As increasing semiconductor integration densities allow more transistors to be integrated onto a microprocessor chip, computer designers are investigating different methods of using these transistors to increase computer system performance. To this end, computer designers are beginning to incorporate multiple central processing units (CPUs) into a single semiconductor chip. This can result in performance gains for computational tasks that can be parallelized (divided) into separate pieces that can be concurrently executed.
Unfortunately, performance gains from parallelization can be limited for many applications that contain inherently serial portions of code. For these inherently serial portions of code, performance is further limited by memory latency problems.
Memory latency problems are growing progressively worse as processor clock speeds continue to improve at an exponential rate. At today""s processor clock speeds, it can take as many as 200 processor clock cycles to pull a cache line in from main memory.
Computer designers presently use a number of techniques to decrease memory latency delays. (1) Out-of-order execution can be used to schedule loads and stores so that memory latency is hidden as much as possible. Unfortunately, out-of-order execution is typically limited to hiding a few clock cycles of memory latency. (2) A non-faulting load instruction can be used to speculatively load a data value, without causing a miss when the address is not valid. (3) A steering load instruction can be used to speculatively load a data value into L2 cache, but not L1 cache, so that L1 cache is not polluted by unused data values. Unfortunately, using non-faulting loads and steering loads can result in unnecessary loads. This wastes instruction cache space and ties up registers. (4) Some researchers have investigated using hardware prefetch engines, but these hardware prefetch engines are typically ineffective for irregular memory access patterns.
What is needed is a method and an apparatus that reduces memory latency delays in fast processor systems without the limitations and costs involved in using the above-discussed techniques.
One embodiment of the present invention provides a system that prefetches from memory by using an assist processor that performs data speculation and that executes in advance of a primary processor. The system operates by executing executable code on the primary processor while simultaneously executing a reduced version of the executable code on the assist processor. This allows the assist processor to generate the same pattern of memory references that the primary processor generates in advance of when the primary processor generates the memory references. While executing the reduced version of the executable code, the system predicts a data value returned by a long latency operation within the executable code. The system subsequently uses the predicted data value to continue executing the reduced version of the executable code without having to wait for the long latency operation to complete. The system also stores results of memory references generated by the assist processor into a store that is shared with the primary processor so that the primary processor is able to access the results of the memory references.
In one embodiment of the present invention, the system additionally executes the long latency operation while the assist processor continues executing the reduced version of the executable code. Next, the system compares a result of the long latency operation with the predicted data value. If the result of the long latency operation does not match the predicted data value, the system suspends further execution by the assist processor. In a variation on this embodiment, after suspending further execution by the assist processor, the system recommences execution by the assist processor using the result of the long latency operation, wherein execution is recommenced at a point prior to use of the predicted data value.
In one embodiment of the present invention, executing the reduced version of the executable code involves selecting loads to predict based upon: how far the assist processor is ahead of the primary processor; predictability of the data value returned by the long latency operation; or a likelihood of a cache miss while performing the long latency operation.
In one embodiment of the present invention, the long latency operation is either a load operation from memory or a computational operation requiring multiple clock cycles.
In one embodiment of the present invention, predicting the data value returned by the long latency operation involves: predicting a value previously returned by the long latency operation; predicting a function of the value previously returned by the long latency operation; or predicting a default value if the long latency operation is being performed for a first time.
In one embodiment of the present invention, prior to executing the executable code, the system compiles source code into the executable code for the primary processor. The system also produces the reduced version of the executable code for the assist processor from the executable code for the primary processor by eliminating instructions from the executable code that have no effect on a pattern of memory references generated by the executable code.
In one embodiment of the present invention, producing the reduced version of the executable code involves: converting store instructions into corresponding load instructions; eliminating redundant load instructions directed to previously loaded cache lines; and eliminating code that is used to calculate store values that are not subsequently used in determining address reference patterns for the executable code for the primary processor.
In one embodiment of the present invention, the reduced version of the executable code is modified to speculatively execute code down a branch path that is more frequently taken if the reduced version of the executable code is determined to not be faster than the executable code.