1. Field of the Invention
The present invention relates to pipelined microprocessors, and more particularly to achieving maximum throughput of dependent operations in a pipelined processor.
2. Art Background
Simple microprocessors generally process instructions one at a time. Each instruction is processed using four sequential stages: instruction fetch, instruction decode, execute, and result write back to the register file or memory. Within such microprocessors, different dedicated logic blocks perform each processing stage. Each logic block waits until all the previous logic blocks complete operations before beginning its operation.
To improve microprocessor efficiency, microprocessor designers overlapped the operations of the fetch, decode, execute, and write back stages such that the microprocessor operates on several instructions simultaneously. In operation, the fetch, decode, execute, and write back stages concurrently process different instructions. At each clock cycle the results of each processing stage are passed to the following processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and write back stages are known as xe2x80x9cpipelinedxe2x80x9d microprocessors.
In order for pipelined microprocessors to operate efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of instructions. However, conditional branch instructions within an instruction stream prevent an instruction fetch unit at the head of a pipeline from fetching the correct instructions until the condition is resolved. Since the condition will not be resolved until further down the pipeline, the instruction fetch unit cannot necessarily fetch the proper instructions.
To alleviate this problem, some newer pipelined microprocessors use branch prediction mechanisms that predict the outcome of branches, and then fetch subsequent instructions according to the branch prediction. Branch prediction is achieved using a branch target buffer (BTB) to store the history of a branch instruction based only upon the instruction pointer or address of that instruction. Every time a branch instruction is fetched, the BTB predicts the target address of the branch using the branch history. For a more detailed discussion of branch prediction, please refer to Tse Yu Yeh and Yale N. Patt, Two-Level Adaptive Branch Prediction, the 24th ACM/IEEE International Symposium and Workshop on MicroArchitecture, November 1991, and Tse Yu Yeh and Yale N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, Proceedings of the Nineteenth International Symposium on Computer Architecture, May 1992.
In combination with speculative execution, out-of-order dispatch of instructions to the execution units results in a substantial increase in instruction throughput. With out-of-order completion, any number of instructions are allowed to be in execution in the execution units, up to the total number of pipeline stages in all the functional units. Instructions may complete out of order because instruction dispatch is not stalled when a functional unit takes more than one cycle to compute a result. Consequently, a functional unit may complete an instruction after subsequent instructions have already completed. For a detailed explanation of speculative out-of-order execution, refer to M. Johnson, Superscalar Microprocessor Design, Prentice Hall, 1991, Chapters 2, 3, 4, and 7.
In a processor using out-of-order completion, instruction dispatch is stalled when there is a conflict for a functional unit or when an issued instruction depends on a result that is not yet computed. In order to prevent or mitigate stalls in decoding, a buffer (known as a reservation station (RS) may be provided between the decode and execute stages. The processor decodes instructions and places them into the reservation station as long as there is room in the buffer, and at the same time, examines instructions in the reservation station to find those that can be dispatched to the execution units (that is, instructions for which all source operands and the appropriate execution units are available).
Instructions are dispatched from the reservation station with little regard for their original program order. However, the capability to issue instructions out-of-order introduces a constraint on register usage. To understand this problem, consider the following pseudo-microcode sequence:
1. t←load (memory)
2. eax←add (eax, t)
3. ebx←add (ebx, eax)
4. eax←mov (2)
5. edx←add (eax, 3)
The micro-instructions and registers shown above are those of the well known Intel Microprocessor Architecture. For further information, reference may be made to the i486(trademark) Microprocessor Programmers Reference Manual, published by Osborne-McGraw-Hill, 1990, which is also available directly from Intel Corporation of Santa Clara, Calif.
In an out-of-order machine executing these instructions, it is likely that the machine would complete execution of the fourth instruction before the second instruction, because the third ADD instruction may require only one clock cycle, while the load instruction and the immediately following ADD instruction may require a total of four clock cycles, for example. However, if the fourth instruction is executed before the second instruction, then the fourth instruction would probably incorrectly overwrite the first operand of the second instruction, leading to an incorrect result. Instead of the second instruction producing a value that the third instruction would use, the third instruction produces a value that would destroy a value that the second one uses.
This type of dependency is called a storage conflict, because the reuse of storage locations (including registers) causes instructions to interfere with one another, even though the conflicting instructions are otherwise independent. Such storage conflicts constrain instruction dispatch and reduce performance.
Storage conflicts may be avoided by providing additional registers that are used to reestablish the correspondence between registers and values. Using register renaming, these additional xe2x80x9cphysicalxe2x80x9d registers are associated with the original xe2x80x9clogicalxe2x80x9d registers and values needed by the program. To implement register renaming, the processor may allocate a new register for every new value produced, i.e., for every instruction that writes a register. An instruction identifying the original logical register for the purpose of reading its value obtains instead the value in the newly allocated register. Thus, the hardware renames the original register identifier in the instruction to identify the new register and the correct value. The same register identifier in several different instructions may access different hardware registers depending on the locations of register references with respect to the register assignments.
With renaming, the example instruction sequence depicted above becomes:
1. ta←load (mem)
2. eaxb←add (eaxa,ta)
3. ebxb←add (ebxa,eaxb)
4. eaxc←mov (2)
5. edxa←add (eaxc,3)
In this sequence, each assignment to a register creates a new instance of the register, denoted by an alphabetic subscript. The creation of a renamed register for eax in the fourth instruction avoids the resource dependency on the second and third instructions, and does not interfere with correctly supplying an operand to the fifth instruction. Renaming allows the fourth instruction to be dispatched immediately, whereas, without renaming, the instruction must be delayed until execution of the second and third instructions. When an instruction is decoded, its result value is assigned a location in a functional unit called a reorder buffer (ROB), and its destination register number is associated with this location. This renames the destination register to the reorder buffer location. When a subsequent instruction refers to the renamed destination register, in order to obtain the value considered to be stored in the register the instruction may instead obtain the value stored in the reorder buffer if that value has already been computed.
The use of register renaming in the ROB not only avoids register resource dependencies to permit out-of-order execution, but also plays a key role in speculative execution. If the instruction sequence given above is considered to be part of a predicted branch, then one can see that execution of those instructions using the renamed registers in the ROB has no effect on the actual registers denoted by instruction. Thus, if it is determined that the branch was mispredicted, the results calculated and stored in the ROB may be erased and the pipeline flushed without affecting the actual registers found in the processor""s register file (RF). If the predicted branch affected the values in the RF, then it would be difficult to recover from branch misprediction because it would be difficult to determine the values stored in the registers before the mispredicted branch was taken without the use of redundant registers in the ROB.
When a result is produced, it is written to the ROB. The result may provide an input operand to one or more waiting instructions buffered in the reservation station, indicating that the source operand is ready for dispatch to one or more execution units along with the instructions using the operand. When dependent instructions are pipelined, the process of waiting for the result data to be written back from an execution unit in order to determine the availability of a source operand adds latency to the system, thereby limiting instruction throughput. Further, for source operands that are immediate values or for source operands that are already retired to architecturally visible registers, waiting for a write back as a result of a ROB read further delays the scheduling of operations that might otherwise be scheduled. Thus, it is desired to find a means for increasing the throughput of dependent instructions in a pipelined processor.
The present invention provides a method and apparatus for maximum throughput scheduling of dependent instructions in a pipelined processor. Each instruction is buffered in a reservation station awaiting dispatch to an execution unit. Dispatch occurs when all of an instruction""s source operands are available and the appropriate execution unit is available. Each instruction entry in the reservation station includes at least one source data field for storing a source operand of the instruction and an associated source data valid bit. Maximum throughput or xe2x80x9cback-to-backxe2x80x9d scheduling is achieved by maximizing the efficiency in which the processor determines the availability of the source operands of a dependent instruction and in which the processor provides those operands to the execution unit executing the dependent instruction. These two operations are implemented through a number of mechanisms.
One mechanism for determining the availability of source operands, and hence the readiness of a dependent instruction for dispatch to an available execution unit, relies on the prospective determination of the availability of a source operand before the operand itself is actually computed as a result of the execution of another instruction. Storage addresses of the source operands of an instruction are stored in a content addressable memory (CAM). Before an instruction is executed and its result data written back, the storage location address of the result is provided to the CAM and associatively compared with the source operand addresses stored therein. A CAM match and its accompanying match bit indicate that the result of the instruction to be executed will provide a source operand to the dependent instruction waiting in the reservation station.
Readiness of a source operand may also be determined according to the state of the source data valid bit. Upon allocation of a dependent instruction containing an immediate operand to the reservation station, the source data valid bit associated with the immediate operand is set. Additionally, for allocation of a dependent instruction containing an operand which has already been retired to the processor""s real register file (RRF), the source data valid bit associated with the retired operand is set. Also, the valid bit may be set and used to determine the availability of an operand if the result has been computed by a previous instruction that has already been executed.
Based upon the match bits and/or the source valid bits, a ready logic circuit determines whether all source operands of a dependent instruction are available and thus whether an instruction is ready for dispatch to an available execution unit.
An execution unit receiving a dispatched instruction obtains the source operands by a number of mechanisms. If the operand is an immediate value, then the execution unit receives that value from the source data field of the reservation station entry storing the dispatched instruction. If the operand was already computed through execution of a previous instruction before allocation of the dispatched dependent instruction to the reservation station, then the operand is written to a register buffer. The register buffer comprises a reorder buffer storing speculative result data and a real register file holding retired result data. Upon allocation of the dependent instruction to the reservation station, the operand is written from the register buffer to the appropriate source data field of the instruction in the reservation station. If the operand is computed after allocation, but before dispatch of the dependent instruction, then the operand is written directly to the appropriate source data field of the reservation station entry storing the instruction. Finally, using a bypass mechanism of the present invention, if the operand is computed after dispatch of the dependent instruction, then the source operand is provided directly from the execution unit computing the source operand to a source operand input of the execution unit executing the dependent instruction. In the case of source operands which are immediate values or values which have already retired to the real register file, the source valid bits for these sources may be set early in the pipeline, thus providing for even earlier scheduling of dependent operations.
Through these mechanisms, the combination of efficiently determining the readiness of an instruction for dispatch and efficiently providing source operands to an execution unit result in maximum instruction execution throughput.