To achieve higher performance levels, processor and system designers attempt to increase processor and system clock rates and increase the amount of work done per clock period. Among other influences, striving for higher clock rates drives toward de-coupled designs and semi-autonomous units with minimal synchronization between units. Increased work per clock period is often achieved using additional functional units and attempting to fully exploit the available instruction-level parallelism.
While compilers can attempt to expose the instruction-level parallelism which exists in a program, the combination of attempting to minimize path length and a finite number of architected registers often artificially inhibits a compiler from fully exposing the inherent parallelism of a program. There are many situations (such as the instruction sequence below) where register resources prevent a more optimal sequencing of instructions.
FM FPR5.rarw.FPR4, FPR4 PA1 FMA FPR2.rarw.FPR3, FPR4, FPR5 PA1 FMA FPR4.rarw.FPR6, FPR7, FPR8
Here, given that most processors have multi-cycle floating point pipelines, the second instruction cannot execute until several cycles after the first instruction starts to execute. In this case, although the source registers of the third instruction might be expected to be available and the third instruction is expected to be ready to execute before the second, the compiler cannot interchange the two instructions without selecting a different register allocation (since the third instruction currently overwrites the FPR4 value used by instruction 2). Often, selecting a register allocation which would be more optimal for this pair of instructions would be in conflict with the optimal register allocation for another instruction pair in the program.
The dynamic behavior of cache misses provides another example where out-of-order execution can exploit more instruction-level parallelism than possible in an in-order machine.
______________________________________ Loop: Load GPR4, 8(GPR5) Add GPR6, GPR6, GPR4 Load GPR7, 8(GPR3) Add GPR8, GPR8, GPR7 Load GPR9, 0(GPR6) Load GPR2, 0(GPR8) . . . branch conditional Loop ______________________________________
In this example, on some iterations there will be a cache miss for the first load; on other iterations there will be a cache miss for the second load. While there are logically two independent streams of computation, in an in-order processor, processing will halt shortly after a cache miss and it will not resume until the cache miss has been resolved.
This example also shows a cascading effect of out-of-order execution; by allowing progress beyond a stalled instruction (in this example an instruction which is dependent on a load with a cache miss), subsequent cache misses can be detected and the associated miss penalty can be overlapped (at least partially) with the original miss. The likelihood of overlapping cache miss penalties for multiple misses grows with the ability to support out-of-order load/store execution.
As clock rates go higher and higher, being able to overlap the cache miss penalties with useful computation and other cache misses will be of growing importance.
Many current processors extract much of the available instruction-level parallelism by allowing out-of-order execution for all units except for the load/store unit. Mechanisms to support out-of-order execution for non-load/non-store units is well understood; all potential conflicts between two instructions can be detected by simply comparing the register fields specified statically in the instruction.
Out-of-order execution of storage reference instructions is considerably a more difficult problem as conflicts can arise through storage locations, and the conflicts cannot be detected without the knowledge of the addresses being referenced. The generation of the effective/virtual address and the translations to a real address are normally performed as part of the execution of a storage reference instruction. Therefore, when a storage reference instruction is executed before a logically earlier instruction is executed, the address for the logically earlier instruction is not available for comparison during the execution of the current instruction.
To support loads which execute out of order with respect to stores, a mechanism is required to detect (and correct) the occurrences where a load executed prior to a logically prior store; where the load got the data for the location prior to being modified by the store and the correct data for the load included bytes from the store operation.
Similarly, to execute stores out of order with respect to loads, a mechanism is required to keep a store from destroying data which will be used by a logically earlier load.
Finally, to support loads that execute out of order with respect to each other, a mechanism is required to ensure that any pair of loads (which access at least one byte in common) return data consistent with executing the loads in order. This is an architectural requirement enforced by most, if not all, multiprocessor ("MP") systems.