This invention relates to computers and computer system central processing units especially as to methods and apparatus for the forwarding of operands between instructions and for prediction and out-of-order execution of instructions.
1. Description of the Prior Art
A typical computer includes a register file, a memory and cache unit, an input-output unit and multiple instruction execution units. A number of strategies have been used to increase the speed of program execution in computer systems. Instruction execution is usually carried out in a pipeline wherein a portion of the work of each instruction is accomplished in each of the stages. Within the past few years, many computer systems have employed multiple pipelines (superscalar) so that more than one instruction can be started at each basic cycle of the processing unit. Because conditional branch instructions are encountered with great frequency in typical computer programs, much design effort has concentrated on predicting the directions that branching instructions will take prior to their actual execution so that a longer stream of instructions can be readied and fed to the multiple execution units. A prediction mechanism that guesses wrong can cause a stall in the feeding of instructions to the execution pipelines and a consequent fall-off in performance. A problem in designing superscalar computers is that the instructions to be executed in parallel may depend on each other. When one instruction needs the results from another instruction which has not yet completed, the computer system must detect this dependency and arrange for the dependent instruction to wait for that result and then resume or start execution. This operation increases the complexity of the logic used to govern instruction flow. Some computer systems can execute instructions out-of-order. If an instruction is blocked because a needed operand is not yet available, then some instruction occurring after that in program order but that is not blocked will be started. Such out-of-order execution can also mask the effects of having to wait for slow actions such as accessing the main memory. But out-of-order execution increases the complexity of the logic. Out-of-order execution may be accompanied by means for retiring instructions (materializing their results) in program order as is done in the Intel P6.
FIG. 1A shows the organization of a modern superscalar processor from the prior art. In that processor, instructions are fetched by the instruction fetch unit 103 through the bus interface 102 and are decoded by the instruction decode unit 104. The fetch 103 and decode 104 units operate in program order. Decoded instructions are sent to the register alias table 106 where logical register references are converted to physical register references. The resulting instructions are a pool of instructions stored in the reorder buffer 105 which is implemented as a content-addressable memory. The reservation station 107 selects operations from the instruction pool in the reorder buffer 105 depending on their status. If an operation has all its operands ready and an execution (pipeline) unit of the correct type is available, then the operation is removed from the reorder buffer and sent to the available execution unit 112, 113. Operation of the reservation station and execution pipelines is an out-of-order process based on dataflow (operand availability) constraints and resource (execution unit) availability without regard to original ordering of the program. This means that the processor can make forward progress and be doing useful work even while some instructions are awaiting the delivery of operands from memory or from execution units. The reservation station unit 107 will look 20 to 30 instructions ahead in the reorder buffer (105) to find instructions that can be executed. Results of instruction execution are sent to the reorder and retirement unit 114 where they await retirement of their predecessor instructions in program order.
To effectively process instructions well ahead of the current program (retirement) counter, superscalar processors such as shown in FIG. 1A must guess the directions of many future conditional branch instructions. Such branch predictions can be made with a high degree of accuracy but the cumulative accuracy falls off markedly as the number of consecutive branches that must be correctly guessed increases. If the directions of two consecutive branches are each guessed with 90% accuracy then the correct direction through both branches is only 81% accurate.
Another performance improvement strategy has been to pack multiple instructions into long packets of instructions and then to fetch and execute these instructions in parallel. This `very long instruction word` (VLIW) approach leaves the determination of which instructions will be executed together to a specialize compiler program. The necessity of recompiling means that old programs cannot easily exploit very-long-instruction-width architectures. VLIW machines are also limited to executing a single instruction stream.
A third approach allows operands to flow among instructions. Any instruction which has all of its input operands available is stare as soon as the operands become ready and an execution unit becomes available. This data flow technique will lead to high degrees of parallelism as a large number of operands could be flowing through a system triggering large numbers of instruction executions at nearly the same time. Data flow computers have required the use of special computer languages that can express intended data flows. The need for special, data flow languages and compilers severely limits the usefulness of data flow computers. In the MOVE architecture Corporaal & Mulder explain a data-triggered scheme where operands trigger execution of receiving instructions. Their architecture should provide increased parallelism but requires special compiler programs which must discover the operand flows when programs are compiled. These programs will encode at least some of the data flows without requiring programmers to use special data flow languages. However, existing programs for prior architectures cannot be run on MOVE machines.
A problem in prior art computer microarchitectures is to discover those instructions that are independent of each other and that can be executed together. Determining the independence of pairs of instructions in the prior art requires a number of comparisons that grows much faster than the number of instructions that might be considered for concurrent execution. Increasing the size of the `window` of instructions being considered for execution requires more than a linear increase in hardware resources when using prior art computer architectures. For the most part, such expensive comparisons are not needed when the method of the present invention is used to map operand flows. Instructions will be started as soon as their operands are ready.
The foregoing and other objects, features and advantages of the invention will be apparent from the following detailed description in conjunction with the drawings.