To achieve higher performance levels, processor and system designers attempt to increase processor and system clock rates and increase the amount of work done per clock period. Among other influences, striving for higher clock rates drives toward de-coupled designs and semi-autonomous units with minimal synchronization between units. Increased work per clock period is often achieved using additional functional units and attempting to fully exploit the available instruction-level parallelism.
While compilers can attempt to expose the instruction-level parallelism which exists in a program, the combination of attempting to minimize path length and a finite number of architected registers often artificially inhibits a compiler from fully exposing the inherent parallelism of a program. There are many situations (such as the instruction sequence below) where register resources prevent a more optimal sequencing of instructions.
FM FPR5.rarw.FPR4, FPR4 PA1 FMA FPR2.rarw.FPR3, FPR4, FPR5 PA1 FMA FPR4.rarw.FPR6, FPR7, FPR8
Here, given that most processors have multi-cycle floating point pipelines, the second instruction cannot execute until several cycles after the first instruction starts to execute. In this case, although the source registers of the third instruction might be expected to be available and the third instruction is expected to be ready to execute before the second, the compiler cannot interchange the two instructions without selecting a different register allocation (since the third instruction currently overwrites the FPR4 value used by instruction 2). Often, selecting a register allocation which would be more optimal for this pair of instructions would be in conflict with the optimal register allocation for another instruction pair in the program.
The dynamic behavior of cache misses provides another example where out-of-order execution can exploit more instruction-level parallelism than possible in an in-order machine.
______________________________________ Loop: Load GPR4, 8(GPR5) Add GPR6, GPR6, GPR4 Load GPR7, 8(GPR3) Add GPR8, GPR8, GPR7 Load GPR9, 0(GPR6) Load GPR2, 0(GPR8) ... branch conditional Loop ______________________________________
In this example, on some iterations there will be a cache miss for the first load; on other iterations there will be a cache miss for the second load. While there are logically two independent streams of computation, in an in-order processor, processing will halt shortly after a cache miss and it will not resume until the cache miss has been resolved.
This example also shows a cascading effect of out-of-order execution; by allowing progress beyond a stalled instruction (in this example an instruction which is dependent on a load with a cache miss), subsequent cache misses can be detected and the associated miss penalty can be overlapped (at least partially) with the original miss. The likelihood of overlapping cache miss penalties for multiple misses grows with the ability to support out-of-order load/store execution.
As clock rates go higher and higher, being able to overlap the cache miss penalties with useful computation and other cache misses will be of growing importance.
Many current processors extract much of the available instruction-level parallelism by allowing out-of-order execution for all units except for the load/store unit. Mechanisms to support out-of-order execution for non-load/non-store units is well understood; all potential conflicts between two instructions can be detected by simply comparing the register fields specified statically in the instruction.
Out-of-order execution of storage reference instructions is considerably a more difficult problem as conflicts can arise through storage locations, and the conflicts cannot be detected without the knowledge of the addresses being referenced. The generation of the effective/virtual address and the translations to a real address are normally performed as part of the execution of a storage reference instruction. Therefore, when a storage reference instruction is executed before a logically earlier instruction is executed, the address for the logically earlier instruction is not available for comparison during the execution of the current instruction.
When performing load and store instructions in a machine with out-of-order and overlapping execution, if it is determined that a load instruction in execute has an overlapping address with a prior store which has not completed, it is usually necessary to either stall the load instruction until the store has completed or cancel the load and any subsequent instructions.
Therefore, there is a need in the art for a system and method for forwarding stored data to a load instruction requiring the data without the need to either stall the load instruction until the store has completed or cancel the load and any subsequent instructions.