1. Field of the Invention
This invention relates generally to processors and computers, and, more particularly, to a method and apparatus for predicting when a load instruction can be executed out-of order before a prior store instruction.
2. Description of the Related Art
Some processors are referred to as out-of-order processors, because these processors do not execute all instructions in the order of the original instruction sequence. A first type of out-of-order processor employs out-of-order execution to improve the performance of a single execution unit, for example to keep the execution unit operating during delays associated with caches retrieving data from the main memory. A second type of processor has multiple execution units and uses out-of-order methods to enhance performance by keeping these execution units operating concurrently much of the time. Flexible instruction execution in out-of-order processors may improve the overall performance.
FIG. 1 illustrates a simple out-of-order processor 2. An instruction fetcher and decoder 4 continually fetches and decodes instructions from a memory 6. The decoded instructions are transferred to a reorder buffer 8 in an order that preserves the original instruction sequence. The reorder buffer 8 is a memory device that is organized like a table. Each occupied entry 9, 10, 11, 12, 13, 14 of the reorder buffer 8 contains an instruction label 16, a memory address or instruction pointer (IP) 18 for the instruction, one or more status bits 20, and a result location 21 for execution results. The reorder buffer 8 has a circular organization with pointers 22, 24 directed to the oldest and newest instructions 14, 9 transferred therein, respectfully. A controller 26 sends instructions from the reorder buffer 8 to execution units 28, 30, 32. The controller 26 checks the status bits 20 of the entries 9, 10, 11, 12, 13, 14 to determine which instructions are unexecuted, i.e. status bits indicate "undone." Then, the controller 26 sends unexecuted instructions 9, 10, 12 to the execution units 28, 30, 32. Preferably, the fetcher and decoder 4 tries to keep the reorder buffer 8 full, and preferably, the controller 26 tries to execute instructions in the reorder buffer 8 as fast as possible by using the possibility of executing instructions out-of-the-order of the original instruction sequence. After an instruction is executed, the controller 26 stores the execution results to the result location 21 associated with the executed instruction 11, 14 and changes the status bit 20 to "done." A retirement unit 34 periodically reads the status bits 20 of the entries 9, 10, 11, 12, 13, 14 of the reorder buffer 16 and retires executed or "done" instructions, 14 by sending the results to memory or registers 36. The retirement unit 34 retires the executed instructions 11, 14 in an order that strictly follows the original instruction sequence.
Referring to FIG. 1, the processor 2 preferably still executes a dependent instruction, e.g., the load entry 11, only after completing the execution of the instruction on which the dependent instruction depends, e.g., the store entry 12. Load instructions, such as the load instruction 9, that load data from an address that is not the address that data is stored to by an earlier store instruction, concurrently in the reorder buffer 8, are independent and may be executed in advance of their position in the instruction order. In the prior art, the retirement unit 34 checked additional status bits 20, indexing dependencies before retiring executed instructions. If the status bits 20 indicated that the dependent instruction was executed before the instruction on which it depended, as illustrated by the load and store entries 11, 12 of FIG. 1, the retirement unit 34 flushed the execution results of the dependent instruction from the reorder buffer 8 and reset the status bits 20 of the dependent instruction for re-execution, i.e. the entry 11 for the load instruction in FIG. 1. These flushes and re-executions slowed processor operation. Of course, not advancing the execution of any instructions out-of-order may also slow the processor by introducing clock periods during which instructions are not executed.
Referring to FIG. 1, the entry 11 for the load instruction cannot be safely executed before the entry 12 for the store instruction, because the store instruction is earlier in the instruction sequence and stores data to the address from which the load instruction loads data. Such store and load instructions are generically known as colliding load and store instructions. Though the entries 11, 12 are clearly colliding at the time illustrated in FIG. 1, the processor 2 often does not know that a generic store instruction will collide with a subsequent load instruction when the controller 26 wants to send the load instruction for execution. In such cases, executing the load instruction first is risky, because the processor may subsequently learn that the load is colliding and have to make a costly flush of the reorder buffer 8. Techniques that predict colliding store instructions may help avoid costly flushes while still allowing the advancement of the execution of load instructions.
The prior art has used deterministic techniques to determine when a store instruction will store data to the same address as a subsequent load instruction, i.e. collide with the load. Software and hardware techniques have enabled advancing the execution of loads before earlier stores when the associated data addresses could be determined with certainty. Nevertheless, experience indicates many other load instructions may be safely executed before earlier stores in the instruction sequence even though the absence of a collision cannot be guaranteed with certainty.
The prior art also contains other techniques for speculative advancement of the execution of load instructions. In speculative advancement, the load instructions are provisionally executed before earlier store instructions when it is probable that the load and store instructions involve different data addresses, i.e. are probably not colliding. In one prior art method, the load instruction is marked provisional after being executed before an earlier store instruction. The method stores a verification instruction at the original data address of the provisionally executed load instruction. The data address is checked for modification before retirement of the load instruction from the reorder buffer 14. Purely hardware devices have also been used to speculatively predict colliding store and load pairs. A load predicted to be colliding is either not advanced before the store member of the pair, or data from the store instruction is forwarded so that the load instruction can be safely executed in advance. The prediction of colliding store and load instruction pairs is complicated and may involve substantial processor resources. Furthermore, these prior art techniques miss some load instructions that can be safely executed earlier, and opportunities for faster processor operation are lost.
The present invention is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.