1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to a method and an apparatus for prefetching instructions from memory by using an assist processor to perform prefetch operations in advance of a primary processor.
2. Related Art
As increasing semiconductor integration densities allow more transistors to be integrated onto a microprocessor chip, computer designers are investigating different methods of using these transistors to increase computer system performance. To this end, computer designers are beginning to incorporate multiple central processing units (CPUs) into a single semiconductor chip. This can result in performance gains for computational tasks that can be parallelized (divided) into separate pieces that can be concurrently executed.
Unfortunately, performance gains from parallelization can be limited for many applications that contain inherently serial portions of code. For these inherently serial portions of code, performance is further limited by memory latency problems.
Memory latency problems are growing progressively worse as processor clock speeds continue to improve at an exponential rate. At today""s processor clock speeds, it can take as many as 200 processor clock cycles to pull a cache line in from main memory.
Computer designers presently use a number of techniques to decrease memory latency delays. (1) Out-of-order execution can be used to schedule loads and stores so that memory latency is hidden as much as possible. Unfortunately, out-of-order execution is typically limited to hiding a few clock cycles of memory latency. (2) A non-faulting load instruction can be used to speculatively load a data value without causing a fault when the address is not valid. (3) A steering load instruction can be used to speculatively load a data value into L2 cache, but not L1 cache, so that L1 cache is not polluted by unused data values. Unfortunately, using non-faulting loads and steering loads can result in unnecessary loads. This wastes instruction cache space and ties up registers. (4) Some researchers have investigated using hardware prefetch engines, but these hardware prefetch engines are typically ineffective for irregular memory access patterns.
Memory latency delays can also be a problem during instruction fetch operations. Note that an instruction cache miss can cause as much of a delay as a data cache miss. Also note that it is very hard to predict which instructions are likely to be executed next because of the numerous branches and function calls that are commonly interspersed into program code written in modern programming languages.
What is needed is a method and an apparatus that reduces memory latency delays during instruction fetch operations.
One embodiment of the present invention provides a system that prefetches instructions by using an assist processor to perform prefetch operations in advance of a primary processor. The system operates by executing executable code on the primary processor, and simultaneously executing a reduced version of the executable code on the assist processor. This reduced version of the executable code executes more quickly than the executable code, and performs prefetch operations for the primary processor in advance of when the primary processor requires the instructions. The system also stores the prefetched instructions into a cache that is accessible by the primary processor so that the primary processor is able to access the prefetched instructions without having to retrieve the prefetched instructions from a main memory.
In one embodiment of the present invention, prior to executing the executable code, the system compiles source code into executable code for the primary processor. Next, the system profiles the executable code to create instruction traces for frequently referenced portions of the executable code. The system then produces the reduced version of the executable code for the assist processor by producing prefetch instructions to prefetch portions of the instruction traces into a cache that is accessible by the primary processor. The system also inserts communication instructions into the executable code for the primary processor and into the reduced version of the executable code for the assist processor to transfer progress information from the primary processor to the assist processor. This progress information triggers the assist processor to perform the prefetch operations.
In one embodiment of the present invention, the process of compiling the source code and the process of producing the reduced version of the executable code are carried out by a compiler.
In one embodiment of the present invention, if the progress information indicates to the assist processor that the assist processor has prefetched instructions down the wrong path, the reduced version of the executable code causes the assist processor to discontinue prefetching.
In one embodiment of the present invention, the reduced version of the executable code is configured to read control flow history information from special-purpose hardware that records branch history information and call history information. Next, the reduced version of the executable code constructs a predicted path through the executable code based on the control flow history information, and then performs prefetch operations down the predicted path in order to prefetch instructions for the primary processor.
In one embodiment of the present invention, producing the reduced version of the executable code involves constructing a control flow graph for the executable code. In doing so, the system removes loops from the control flow graph, and removes executable code instructions unrelated to the control flow graph. The system also inserts the prefetch instructions into the reduced version of the executable code to prefetch instructions from the executable code for the primary processor.
In one embodiment of the present invention, performing the prefetch operations involves prefetching cache blocks containing multiple instructions for the primary processor.
In one embodiment of the present invention, the system periodically sends the progress information from the primary processor to the assist processor through a one-way communication channel.
In one embodiment of the present invention, the primary processor and the assist processor reside on the same semiconductor chip.
In one embodiment of the present invention, the primary processor and the assist processor reside on distinct semiconductor chips.
In one embodiment of the present invention, the assist processor is a simplified version of the primary processor.