1. Field of the Invention
This invention relates to programming operations in a microprocessor, and more specifically, to the use of a conditional move microinstruction for implementing processor state dependent operations. The invention is particularly pertinent to speculative, out-of-order processors which predict program flow and execute instructions out-of-order, but may also be used in conventional pipelined and non-pipelined processors.
2. Art Background
I. State Dependent Operations in Pipelined, In-Order Microprocessors
Simple microprocessors generally process instructions one at a time. Each instruction can be considered as being processed in five sequential stages: instruction fetch, instruction decode, operand fetch, execute and writeback. During instruction fetch, an instruction pointer from a program counter is sent to an instruction memory, such as an instruction cache, to retrieve a macroinstruction. The macroinstruction is decoded into microinstructions or micro-operations (uops) which specify an opcode in addition to source and destination register addresses. During operand fetch, a register file is addressed with the source register addresses to return the source operand values. In the execution stage, the uop and the source operand values are sent to an execution unit for execution. During writeback, the result value of the microinstruction execution is written to the register file at the destination register address encoded in the microinstruction.
Within simple microprocessors, different dedicated logic blocks perform each processing stage. Each logic block waits until all the previous logic blocks complete operations before beginning its operation. Without pipelining, the microprocessor processes the uops sequentially one after another. However, to improve microprocessor efficiency, microprocessor architectures are now designed with overlapped pipeline stages so that the microprocessor can operate on several uops simultaneously.
In the processing of state dependent instructions, the results derived from execution of these instructions depend upon the current state of the microprocessor. But since the state of the processor may be changed by certain control instructions which may be fetched and decoded but not executed before the fetching of the state dependent instructions, it is possible that some state dependent instructions will be erroneously fetched. This is because the fetching of a state dependent instruction is based upon a processor state that may subsequently be modified by a previously fetched control instruction. In this case, the processor would have to detect the change in state and the fact that a particular state dependent instruction was erroneously fetched so as to stop the execution of the state dependent instruction and cause a fault to occur indicating that the state dependent instruction should not be executed and that another flow of uops should be fetched.
In order to prevent such a situation from occurring, conventional pipelined processors are designed to detect the existence of a control instruction at the decode stage and stall the pipeline by issuing fake uops (no-ops) to the execution unit until the result of the control instruction (i.e. a possible change in state) is determined during its execution. Once the control instruction reaches the execution unit and its execution is complete, the decoder is informed of any change in state and can resume the normal fetching of instructions. Obviously, processors which utilize this method incur a performance penalty due to the number of clock cycles that are wasted during the pipeline stall.
Additionally, in the execution of state dependent instructions, many clock cycles are required to access the state information needed and to resolve their dependencies. For example, in execution of a privileged instruction, the processor would have to read the proper control registers, place the information in the proper format, compare the proper values and perform a select (i.e. a conditional move operation) based upon the comparison. For example, consider the relatively complex pseudo-instruction shown below: EQU IF [(CPL=0) & (IOPL=3) & (VME)], THEN SELECT A (INSTRUCTION EXECUTION), OR ELSE B (FAULT)
In order to resolve the above condition, the following pseudo-uops would be required: EQU A: T0:=compare (CPL,0) EQU T1:=select.sub.-- Equal(A,B) EQU B: T0:=compare (IOPL,3) EQU T1:=select.sub.-- Equal(T1,B) EQU C: T0:=compare (VME, TRUE) EQU T1:=select.sub.-- Equal(T1,B)
The value within register T1 can then be checked by microcode to determine whether execution of the instruction can proceed (T1=A) or whether a fault must be posted (T1=B). In calculation of these operations, however, the processor would require many clock cycles (i.e. approximately 5) in order to (A) read the CPL control register, do a mask to get the lower 2 bits of the CPL register, and compare CPL to 0; (B) read the IOPL value, mask and shift the value, and compare IOPL to 3; and (C) read the processor mode, mask the mode value, and check to see if the mode is enabled. Nonetheless, even after all this has been done, the result of these calculations may indicate that sufficient privilege does not exist, thereby requiring microcode to signal a fault to the writeback logic of the execution unit so that a fault can be posted instead of executing the privileged instruction.
Furthermore, the performance of privilege or mode sensitive algorithms and updates based on processor state (i.e. instructions that modify the control flags based upon processor mode) also give rise to a similar performance penalties. In the case where instructions which modify the control are executed, for example STI, CLI and IRET in the Intel architecture, the execution unit will take several cycles to determine the current processor mode. Thereafter, based on the current mode, a jump (or branch) will or will not be taken to an algorithm or routine which determines whether a particular control flag will be modified. Yet, for processors which predict the flow of instructions instead stalling the pipeline, if the branch is conditionally taken and later found to be mispredicted, more cycles will be lost due to the instructions that were speculatively fetched which must now be canceled or flushed from the pipeline.
Hence, the performance of the above state dependent operations in conventional in-order, pipelined processors significantly reduces the efficiency of the processor due to the wasted cycles needed to stall the processor upon detection of control instructions and those required to resolve the conditions of state dependent instructions or recover from mispredicted branches.
II. Speculative, Out-of-Order Processors
For pipelined microprocessors to operate more efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of instructions. However, conditional branch instructions within an instruction stream prevent the instruction fetch unit from fetching subsequent instructions that are known to be correct since the conditions for such instructions are not resolved until execution.
To alleviate this problem, some newer pipelined microprocessors use branch prediction mechanisms that predict the outcome of branches, and then fetch subsequent instructions according to the branch prediction. Branch prediction is achieved using a branch target buffer to store the history of a branch instruction based only upon the instruction pointer or address of that instruction. Every time a branch instruction is fetched, the branch target buffer predicts the target address of the branch using the branch history. For a more detailed discussion of branch prediction, please refer to Tse Yu Yeh and Yale N. Patt, Two-Level Adaptive Branch Prediction, the 24th ACM/IEEE International Symposium and Workshop on MicroArchitecture, November 1991, and Tse Yu Yeh and Yale N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, Proceedings of the Nineteenth International Symposium on Computer Architecture, May 1992.
In combination with speculative execution, out-of-order dispatch of instructions to the execution units results in a substantial increase in instruction throughput. With out-of-order completion, any number of instructions are allowed to be in execution in the execution units, up to the total number of pipeline stages in all the functional units. Instructions may complete out of order because instruction dispatch is not stalled when a functional unit takes more than one cycle to compute a result. Consequently, a functional unit may complete an instruction after subsequent instructions have already completed. For a detailed explanation of speculative out-of-order execution, please refer to M. Johnson, Superscalar Microprocessor Design, Prentice Hall, 1991, Chapters 2,3,4, and 7.
In a processor using out-of-order execution, instruction dispatch is stalled when there is a conflict for a functional unit or when an issued instruction depends on a result that is not yet computed. In order to prevent or mitigate stalls in decoding, the prior art provides for a temporary storage buffer (referred to herein as a dispatch buffer) between the decode and execute stages. The processor decodes instructions and places (or "issues") them into the dispatch buffer as long as there is room in the buffer, and at the same time, examines instructions in the dispatch buffer to find those that can be dispatched to the execution units (i.e. those instructions for which all source operands and the appropriate execution units are available).
Instructions are dispatched from the dispatch buffer to the execution units with little regard for their original program order. However, the capability to issue instructions out-of-order introduces a constraint on register usage. To understand this problem, consider the following pseudo-microcode sequence:
1. t.rarw.load (memory) PA1 2. eax.rarw.add (eax,t) PA1 3. ebx.rarw.add (ebx,eax) PA1 4. eax.rarw.mov (2) PA1 5. edx.rarw.add (eax,3) PA1 1. t.sub.a .rarw.load (mem) PA1 2. eax.sub.b .rarw.add (eax.sub.a,t.sub.a) PA1 3. ebx.sub.b .rarw.add (ebx.sub.a,eax.sub.b) PA1 4. eax.sub.c .rarw.mov (2) PA1 5. edx.sub.a .rarw.add (eax.sub.c,3)
The micro-instructions and registers shown above are generic and will be recognized by those familiar with the art as those of the well known Intel microprocessor architecture.
In an out-of-order machine executing these instructions, it is likely that the machine would complete execution of the fourth instruction before the second instruction, because the third ADD instruction may require only one clock cycle, while the load instruction and the immediately following ADD instruction may require a total of four clock cycles, for example. However, if the fourth instruction is executed before the second instruction, then the fourth instruction would probably incorrectly overwrite the first operand of the second instruction, leading to an incorrect result. Instead of the second instruction producing a value that the third instruction would use, the fourth instruction produces a value that would destroy a value that the second one uses.
This type of dependency is called a storage conflict, because the reuse of storage locations (including registers) causes instructions to interfere with one another, even though the conflicting instructions are otherwise independent. Such storage conflicts constrain instruction dispatch and reduce performance.
It is known in the art that storage conflicts can be avoided by using register renaming where additional registers are used to reestablish the correspondence between registers and values. Using register renaming, the additional "physical" registers are associated with the original "logical" registers and values needed by the program. To implement this technique, the processor typically allocates a new register for every new value produced (i.e., for every instruction that writes a register). An instruction identifying the original logical register for the purpose of reading its value obtains instead the value in the newly allocated register. Thus, the hardware renames the original register identifier in the instruction to identify the new register and the correct value. The same register identifier in several different instructions may access different hardware registers depending on the locations of register references with respect to the register assignments.
With renaming, the example instruction sequence depicted above becomes:
In this sequence, each assignment to a register creates a new instance of the register, denoted by an alphabetic subscript. The creation of a renamed register for eax in the fourth instruction avoids the resource dependency on the second and third instructions, and does not interfere with correctly supplying an operand to the fifth instruction. Renaming allows the fourth instruction to be dispatched immediately, whereas, without renaming, the instruction must be delayed until execution of the second and third instructions. When an instruction is decoded, its result value is assigned a location in a functional storage unit (referred to herein as a reorder buffer), and its destination register number is associated with this location. This renames the destination register to the reorder buffer location. When a subsequent instruction refers to the renamed destination register, in order to obtain the value considered to be stored in the register, the instruction may instead obtain the value stored in the reorder buffer if that value has already been computed.
The use of register renaming in the reorder buffer not only avoids register resource dependencies to permit out-of-order execution, but also plays a key role in speculative execution. If the instruction sequence given above is considered to be part of a predicted branch, then one can see that execution of those instructions using the renamed registers in the reorder buffer has no effect on the actual registers denoted by instruction. Thus, if it is determined that the branch was mispredicted, the results calculated and stored in the reorder buffer may be erased and the pipeline flushed without affecting the actual registers found in the processor's register file. If the predicted branch affected the values in the register file, then it would be difficult to recover from branch misprediction because it would be difficult to determine the values stored in the registers before the predicted branch was taken without the use of redundant registers in the reorder buffer.
When a result is output from an execution unit, it is written back to the reorder buffer. The result may also provide an input operand to one or more waiting instructions buffered in the dispatch buffer, indicating that the source operand is ready for dispatch to one or more execution units along with the instructions using the operand. After the value is written into the reorder buffer, subsequent instructions continue to fetch the value from the reorder buffer, unless the entry is superseded by a new register assignment and until the value is retired by writing it to the register file.
After the processor determines that the predicted instruction flow is correct, the processor commits the speculative results of those instructions that were stored in the reorder buffer to an architectural state by writing those results to the register file. This process is known as retirement wherein the instructions are architecturally committed or retired according to their original program order (i.e. the original instruction sequence).
III. State Dependent Operations in Out-of-Order Processors
In out-of-order microprocessors, the processor state needed for execution of state dependent instructions is located either in the register file or in microcode control registers distributed throughout the processor's architecture. However, due to the speculative, out-of-order nature of the processor, the problems involved with processing state dependent operations, such as checking privileged instructions, executing privilege or mode sensitive algorithms and updating processor state, become much worse.
One problem is in the out-of-order nature of execution which gives rise to significantly greater performance penalties. The number of pipestages for an out-of-order processor between the decode stage and the retirement stage (where the register file is updated) is increased by approximately 10 stages over that for an in-order processor. Hence, a pipeline stall at the decode stage caused by a control instruction requesting a change of state would waste many more cycles in an out-of-order processor, thereby increasing the performance penalty to an unacceptable value. However, out-of-order does not, in and of itself, increase the length of the pipeline. In one embodiment of the present invention, the microprocessor uses superpipelining, a technique which increases the number of stages in each pipe while shortening each stage. This is done so that pipe stages which require short periods of time to execute are not penalized due to longer periods required by preceding or subsequent pipe stages. This technique is what increases the number of pipe stages in the present invention over past implementations. The primary affect of out-of-order execution is the increase in the number of microinstructions which may be outstanding in the portion of the pipeline which supports out-of-order execution. Also, note that out-of-order execution allows operations which come after a given operation to contend for execution unit resources in some cases. This can further lengthen the pipeline for a microinstruction in the pipeline.
Similarly, the pipeline length in addition to the size of the reorder buffer determine the number of speculative uops that are in the pipeline at any one time, this number ranging between approximately 30-50 uops. Therefore, the cost of taking a speculative branch (i.e. by predicting the result of a conditional move or jump instruction) later found to be mispredicted (at the execute stage) would give rise to another unacceptable performance penalty due to the large number of speculative uops that would have to be flushed in addition to the lost opportunity costs in terms of the clock cycles wasted by the flushed uops.
Furthermore, with regard to microcode determining processor state at the decode stage, such as with privileged instruction checking, the disjunction between the instruction decoder, the execution units and the retirement logic in an out-of-order processor would also require a substantial investment in hardware and microcode to enable state updates to occur at the various functional units throughout the processor. Since the back-end, out-of-order functional units have little control over the front-end, in-order functional units, a substantial amount of communications or signaling hardware would have to be implemented between the decoder and the updated processor state kept in the real register file and in microcode registers throughout the processor. Even so, the broadcasting of state updates would cause more penalties due to the multiple state updates required for each state change.
Accordingly, it is an object of the present invention to provide a method and apparatus in a microprocessor for conditionally selecting one of two data values based upon control states of a processor via a microinstruction.
It is another object of the present invention to provide a method and apparatus for performing processor state dependent operations in an out-of-order processor through the use of microcode while minimizing performance penalties caused by pipestailing, conditional moves and conditional jumps.
It is a further object of the present invention to provide a method and apparatus for performing privileged instruction checking, privilege or mode sensitive algorithm execution and privileged updating in an out-of-order processor through the use of a microinstruction that avoids the complexity and expense of dedicated hardware that would otherwise have to be implemented between the front-end and back-end of the processor.