1. Field of the Invention
The present invention relates to pipelined microprocessors, and more particularly to achieving maximum throughput of dependent operations in a pipelined processor.
2. Art Background
Simple microprocessors generally process instructions one at a time. Each instruction is processed using four sequential stages: instruction fetch, instruction decode, execute, and result write back to the register file or memory. Within such microprocessors, different dedicated logic blocks perform each processing stage. Each logic block waits until all the previous logic blocks complete operations before beginning its operation.
To improve microprocessor efficiency, microprocessor designers overlapped the operations of the fetch, decode, execute, and write back stages such that the microprocessor operates on several instructions simultaneously. In operation, the fetch, decode, execute, and write back stages concurrently process different instructions. At each clock cycle the results of each processing stage are passed to the following processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and write back stages are known as "pipelined" microprocessors.
In order for pipelined microprocessors to operate efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of instructions. However, conditional branch instructions within an instruction stream prevent an instruction fetch unit at the head of a pipeline from fetching the correct instructions until the condition is resolved. Since the condition will not be resolved until further down the pipeline, the instruction fetch unit cannot necessarily fetch the proper instructions.
To alleviate this problem, some newer pipelined microprocessors use branch prediction mechanisms that predict the outcome of branches, and then fetch subsequent instructions according to the branch prediction. Branch prediction is achieved using a branch target buffer (BTB) to store the history of a branch instruction based only upon the instruction pointer or address of that instruction. Every time a branch instruction is fetched, the BTB predicts the target address of the branch using the branch history. For a more detailed discussion of branch prediction, please refer to Tse Yu Yeh and Yale N. Patt, Two-Level Adaptive Branch Prediction, the 24th ACM/IEEE International Symposium and Workshop on MicroArchitecture, November 1991, and Tse Yu Yeh and Yale N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, Proceedings of the Nineteenth International Symposium on Computer Architecture, May 1992.
In combination with speculative execution, out-of-order dispatch of instructions to the execution units results in a substantial increase in instruction throughput. With out-of-order completion, any number of instructions are allowed to be in execution in the execution units, up to the total number of pipeline stages in all the functional units. Instructions may complete out of order because instruction dispatch is not stalled when a functional unit takes more than one cycle to compute a result. Consequently, a functional unit may complete an instruction after subsequent instructions have already completed. For a detailed explanation of speculative out-of-order execution, please refer to M. Johnson, Superscalar Microprocessor Design, Prentice Hall, 1991, Chapters 2, 3, 4, and 7.
In a processor using out-of-order completion, instruction dispatch is stalled when there is a conflict for a functional unit or when an issued instruction depends on a result that is not yet computed. In order to prevent or mitigate stalls in decoding, the prior art provides for a buffer (known as a reservation station (RS)) between the decode and execute stages. The processor decodes instructions and places them into the reservation station as long as there is room in the buffer, and at the same time, examines instructions in the reservation station to find those that can be dispatched to the execution units (that is, instructions for which all source operands and the appropriate execution units are available).
Instructions are dispatched from the reservation station with little regard for their original program order. However, the capability to issue instructions out-of-order introduces a constraint on register usage. To understand this problem, consider the following pseudo-microcode sequence: ##STR1##
The micro-instructions and registers shown above are those of the well known Intel microprocessor architecture. For further information, reference may be made to the i486.TM. Microprocessor Programmers Reference Manual, published by Osborne-McGraw-Hill, 1990, which is also available directly from Intel Corporation of Santa Clara, Calif.
In an out-of-order machine executing these instructions, it is likely that the machine would complete execution of the fourth instruction before the second instruction, because the third ADD instruction may require only one clock cycle, while the load instruction and the immediately following ADD instruction may require a total of four clock cycles, for example. However, if the fourth instruction is executed before the second instruction, then the fourth instruction would probably incorrectly overwrite the first operand of the second instruction, leading to an incorrect result. Instead of the second instruction producing a value that the third instruction would use, the third instruction produces a value that would destroy a value that the second one uses.
This type of dependency is called a storage conflict, because the reuse of storage locations (including registers) causes instructions to interfere with one another, even though the conflicting instructions are otherwise independent. Such storage conflicts constrain instruction dispatch and reduce performance.
It is known in the art that storage conflicts can be avoided by providing additional registers that are used to reestablish the correspondence between registers and values. Using register renaming, these additional "physical" registers are associated with the original "logical" registers and values needed by the program. To implement register renaming, the processor typically allocates a new register for every new value produced, i.e., for every instruction that writes a register. An instruction identifying the original logical register for the purpose of reading its value obtains instead the value in the newly allocated register. Thus, the hardware renames the original register identifier in the instruction to identify the new register and the correct value. The same register identifier in several different instructions may access different hardware registers depending on the locations of register references with respect to the register assignments.
With renaming, the example instruction sequence depicted above becomes: ##STR2##
In this sequence, each assignment to a register creates a new instance of the register, denoted by an alphabetic subscript. The creation of a renamed register for eax in the fourth instruction avoids the resource dependency on the second and third instructions, and does not interfere with correctly supplying an operand to the fifth instruction. Renaming allows the fourth instruction to be dispatched immediately, whereas, without renaming, the instruction must be delayed until execution of the second and third instructions. When an instruction is decoded, its result value is assigned a location in a functional unit called a reorder buffer (ROB), and its destination register number is associated with this location. This renames the destination register to the reorder buffer location. When a subsequent instruction refers to the renamed destination register, in order to obtain the value considered to be stored in the register the instruction may instead obtain the value stored in the reorder buffer if that value has already been computed.
The use of register renaming in the ROB not only avoids register resource dependencies to permit out-of-order execution, but also plays a key role in speculative execution. If the instruction sequence given above is considered to be part of a predicted branch, then one can see that execution of those instructions using the renamed registers in the ROB has no effect on the actual registers denoted by instruction. Thus, if it is determined that the branch was mispredicted, the results calculated and stored in the ROB may be erased and the pipeline flushed without affecting the actual registers found in the processor's register file (RF). If the predicted branch affected the values in the RF, then it would be difficult to recover from branch misprediction because it would be difficult to determine the values stored in the registers before the predicted branch was taken without the use of redundant registers in the ROB.
When a result is produced, it is written to the ROB. The result may provide an input operand to one or more waiting instructions buffered in the reservation station, indicating that the source operand is ready for dispatch to one or more execution units along with the instructions using the operand. When dependent instructions are pipelined, the process of waiting for the result data to be written back from an execution unit in order to determine the availability of a source operand adds latency to the system, thereby limiting instruction throughput. Thus, it is desired to find a means for increasing the throughput of dependent instructions in a pipelined processor.