1. Technical Field
The present invention relates generally to an improved data processing system and, in particular, to a method, apparatus, and computer program product for optimizing performance in a data processing system. Still more particularly, the present invention provides a method, apparatus and computer program product for enhancing performance of an in-order microprocessor with long stalls.
2. Description of Related Art
A microprocessor is a silicon chip that contains a central processing unit (CPU) which controls all the other parts of a digital device. Designs vary widely but, in general, the CPU consists of the control unit, the arithmetic and logic unit (ALU) and memory (registers, cache, RAM and ROM) as well as various temporary buffers and other logic. The control unit fetches instructions from memory and decodes them to produce signals which control the other part of the computer. This may cause the control unit to transfer data between memory and ALU or to activate peripherals to perform input or output. A parallel computer has several CPUs which may share other resources such as memory and peripherals. In addition to bandwidth (the number of bits processed in a single instruction) and clock speed (how many instructions per second the microprocessor can execute, microprocessors are classified as being either RISC (reduced instruction set computer) or CISC (complex instruction set computer).
A technique used in advanced microprocessors where the microprocessor begins executing a second instruction before the first has been completed is called pipelining. That is, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into segments and each segment can execute the segment's operation concurrently with the other segments. When a segment completes an operation, the segment passes the result to the next segment in the pipeline and fetches the next operation from the preceding segment. The final results of each instruction emerge at the end of the pipeline in rapid succession. This arrangement allows all the segments to work in parallel thus giving greater throughput than if each input had to pass through the whole pipeline before the next input could enter. The costs are greater latency and complexity due to the need to synchronize the segments in some way so that different inputs do not interfere. The pipeline only works at full efficiency if the pipeline can be filled and emptied at the same rate that the pipeline can process.
In a pipelined in-order processor with long latencies, cache misses and translation misses create long stalls which can hinder performance significantly. Out-of-order machines reduce the penalty incurred when an instruction is unable to execute by allowing other, subsequent instructions to execute independently. The drawback of an out-of-order machine is the tremendous complexity required to find independent instructions and resolve dependency hazards. As processor speed increases, supporting such complexity becomes impractical. The use of touch instructions can reduce the likelihood of a cache miss because touch instructions allow a program to request a cache block fetch before the instruction is actually needed by the program. But touch instructions require foreknowledge at compile time and occupy instruction slots that could otherwise hold other instructions. Prefetch mechanisms can also reduce cache misses by anticipating which instructions are likely to be executed in the future, but are inexact.
Therefore, it would be advantageous to have an improved method, apparatus, and computer program product for reducing time lost to stalls. It would further be advantageous to have a mechanism for enhancing Load/Store performance of an in-order processor that has long stalls.