1. Field Of The Invention
This invention relates to programming operations in a microprocessor, and more specifically, to the signaling of events, including inter alia exceptions, assists, mispredicted branches and state updates, in microcode. The invention is particularly pertinent to speculative, out-of-order processors which predict program flow and execute instructions out-of-order, but may also be used in conventional pipelined and non-pipelined processors.
2. Art Background
I. Program Flow In Pipelined Processors
Simple microprocessors generally process instructions one at a time. Each instruction can be considered as being processed in five sequential stages: instruction fetch (IF), instruction decode (ID), operand fetch (OF), execute (EX), and writeback (WB). During instruction fetch, an instruction pointer from a program counter is sent to an instruction memory, such as an instruction cache, to retrieve an instruction. The instruction is decoded to obtain an opcode in addition to source and destination register addresses. During operand fetch, a register file is addressed with the source register addresses to return the return source operand values. In the execution stage, the instruction and the source operand values are sent to an execution unit for execution. During writeback, the result value of the execution is written to the register file at the destination register address obtained during the instruction fetch stage.
In more complex systems, macroinstructions are fetched from memory and translated into one or more microinstructions, which are sequentially placed into the machine by the decoder and executed one at a time (i.e., operand fetch, execute and writeback). The group of microinstructions associated with an macroinstruction is called a flow. Whenever the last microinstruction of a flow has written back, the macroinstruction is said to have completed.
Within the microprocessor, there are two flows of control. One which governs the fetching of instructions (the instruction flow) and one which governs the issue of microinstructions to execute an instruction (microcode control flow). The execution of an instruction, in and of itself, may change the instruction flow, however, this does not necessarily imply a microcode flow change. Conversely, a branch may be executed by microcode to change the microcode control flow without altering the instruction flow. Certain exceptional conditions may be detected during the execution of a microinstruction which cause a microcode control flow change. These conditions are known as events. Events, however, may or may not result in an instruction flow change, depending on the type of event.
One way in which an event may be generated is through the occurrence of a condition that causes an error in execution of the instruction and requires special macrocode routines to service the condition that was detected during execution. These type of events are referred to as exceptions. Exceptions include faults, traps and interrupts (both software and hardware). Another type of event is an assist, which is a condition detected during execution that requires a special microcode flow to be executed in order to "assist" the processor in handling the condition during its execution. Mispredicted branches are also classified as events. They occur through the execution of conditional branching instructions (either macro or micro instructions) that were predicted by the microprocessor to take one path, but in their execution, took another path. In these cases, the machine must discontinue fetching and execution on the predicted path and resume fetching and execution on the correct path.
All in all, exceptions, assists, and branches are only three of a variety of events which can be posted. These events need not directly affect the instruction or microcode control flow. In some cases, they just cause special updates of the macroarchitectural or microarchitectural state of the processor due to a condition detected by the microinstruction.
Without pipelining, the processing of a simple sequence of instructions including a branch instruction may be depicted as shown in Table 1:
TABLE 1 __________________________________________________________________________ Pipeline Diagram for a Non-Pipelined Processor Instruction Time (Cycle #) # Operation 1 2 3 4 5 6 7 8 9 10 11 12 12 14 15 __________________________________________________________________________ 100 Add . . . IF ID OF EX WB 101 Jump 200 IF ID OF EX WB 200 Add . . . IF ID OF EX WB __________________________________________________________________________
To improve microprocessor efficiency, microprocessor designers overlap the pipeline stages so that the microprocessor operates on several instructions simultaneously. The instruction sequence of Table 1 may be pipelined as follows so that the execution of instruction 100 occurs at the same time as the fetching of the operands for instruction 101:
TABLE 2 ______________________________________ Pipeline Diagram for a Pipelined Processor Instruction Time (Cycle #) # Operation 1 2 3 4 5 6 7 ______________________________________ 100 Add . . . IF ID OF EX WB 101 Jump 200 IF ID OF EX WB 200 Add . . . IF ID OF EX WB ______________________________________
Pipelining improves instruction throughput by overlapping the instruction cycle pipe stages. However, in the case of branch instructions, it may be necessary to fetch the next instruction before it is determined whether the branch instruction is a taken branch or before it is determined whether an exception occurs during execution of the branch instruction. Note that this applies to both instructions and microinstructions. Just as the instruction fetch mechanism speculates on the direction of the branch and continues fetching instructions based on that speculation, so too microinstructions which lie after a conditional microcode branch may be sent into the processor for execution before the branch has been resolved. A microinstruction sequencer speculates on the direction of the microcode branch and continues fetching microinstructions based on that speculation. Note also that the fetching of subsequent instructions or microinstructions may also occur before the processor is able to determine that a preceding instruction or microinstruction generated an event.
For example, in Table 2, the result of the jump instruction at address 101 may not be known until the execution stage of the instruction. Given this, the processor may continue fetching instructions sequentially after 101, as depicted below.
TABLE 3 __________________________________________________________________________ Pipeline Diagram for a Pipelined Processor Instruction Time (Cycle #) # Operation 1 2 3 4 5 6 7 8 9 10 __________________________________________________________________________ 100 Add . . . IF ID OF EX WB 101 Jump 200 IF ID OF EX WB 102 . . . IF ID OF Flush 103 . . . IF 200 Add . . . IF ID OF EX WB __________________________________________________________________________
If an exception or a mispredicted branch condition is detected during the execution of instruction 101, then the two subsequent instructions (#102 and #103) were erroneously fetched, and a mechanism must be provided for handling this erroneous fetching, such as flushing or canceling the erroneously fetched instructions. However, this causes a performance penalty in the microprocessor due to the time and resources lost in the fetching and subsequent flushing of the erroneously fetched instructions.
To complicate matters even more, many processors go further than simple pipelining and include superpipelining and/or superscalar operations. Superpipelining increases the granularity of the instruction pipeline; e.g., instead of allocating one clock cycle for instruction execution, two cycles may be employed such that it would take longer for an exception to be detected. A superscalar processor, on the other hand, executes a plurality of instructions per pipeline. The addition of these features to a microprocessor further adds to its performance penalty when an event occurs since more erroneously fetched instructions would have to be flushed from the pipeline.
II. Speculative, Out-Of-Order Processors
In order for pipelined microprocessors to operate efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of instructions. However, conditional branch instructions within an instruction stream prevent the instruction fetch unit from fetching subsequent instructions that are known to be correct since the conditions for such instructions are not resolved until execution.
To alleviate this problem, some newer pipelined microprocessors use branch prediction mechanisms that predict the outcome of branches, and then fetch subsequent instructions according to the branch prediction. Branch prediction is achieved using a branch target buffer to store the history of a branch instruction based only upon the instruction pointer or address of that instruction. Every time a branch instruction is fetched, the branch target buffer predicts the target address of the branch using the branch history. For a more detailed discussion of branch prediction, please refer to Tse Yu Yeh and Yale N. Patt, Two-Level Adaptive Branch Prediction, the 24th ACM/IEEE International Symposium and Workshop on MicroArchitecture, November 1991, and Tse Yu Yeh and Yale N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, Proceedings of the Nineteenth International Symposium on Computer Architecture, May 1992.
In combination with speculative execution, out-of-order dispatch of instructions to the execution units results in a substantial increase in instruction throughput. With out-of-order completion, any number of instructions are allowed to be in execution in the execution units, up to the total number of pipeline stages in all the functional units. Instructions may complete out of order because instruction dispatch is not stalled when a functional unit takes more than one cycle to compute a result. Consequently, a functional unit may complete an instruction after subsequent instructions have already completed. For a detailed explanation of speculative out-of-order execution, please refer to M. Johnson, Superscalar Microprocessor Design, Prentice Hall, 1991, Chapters 2,3,4, and 7.
In a processor using out-of-order execution, instruction dispatch is stalled when there is a conflict for a functional unit or when an issued instruction depends on a result that is not yet computed. In order to prevent or mitigate stalls in decoding, the prior art provides for a temporary storage buffer (referred to herein as a dispatch buffer) between the decode and execute stages. The processor decodes instructions and places (or "issues") them into the dispatch buffer as long as there is room in the buffer, and at the same time, examines instructions in the dispatch buffer to find those that can be dispatched to the execution units (i.e., those instructions for which all source operands and the appropriate execution units are available).
Instructions are dispatched from the dispatch buffer to the execution units with little regard for their original program order. However, the capability to issue instructions out-of-order introduces a constraint on register usage. To understand this problem, consider the following pseudomicrocode sequence:
1. t.rarw.load (memory) PA1 2. eax.rarw.add (eax,t) PA1 3. ebx.rarw.add (ebx,eax) PA1 4. eax.rarw.mov (2) PA1 5. edx.rarw.add (eax,3) PA1 1. t.sub.a .rarw.load (mem) PA1 2. eax.sub.b .rarw.add (eax.sub.a,t.sub.a) PA1 3. ebx.sub.b .rarw.add (ebx.sub.a,eax.sub.b) PA1 4. eax.sub.c .rarw.mov (2) PA1 5. edx.sub.a .rarw.add (eax.sub.c,3) PA1 1) Instruction Fetching, PA1 2) Instruction Decoding, PA1 3) Register Renaming, PA1 4) Instruction Allocation, PA1 5) Instruction Execution, PA1 6) Result Writeback, and PA1 7) Instruction Retirement.
The microinstructions and registers shown above are generic and will be recognized by those familiar with the art as those of the well known Intel Architecture.TM..
In an out-of-order processor executing these instructions, it is likely that the processor would complete execution of the fourth instruction before the second instruction, because the third ADD instruction may require only one clock cycle, while the load instruction and the immediately following ADD instruction may require a total of four clock cycles, for example. However, if the fourth instruction is executed before the second instruction, then the fourth instruction would probably incorrectly overwrite the first operand of the second instruction, leading to an incorrect result. Instead of the second instruction producing a value that the third instruction would use, the fourth instruction produces a value that would destroy a value that the second one uses.
This type of dependency is called a storage conflict, because the reuse of storage locations (including registers) causes instructions to interfere with one another, even though the conflicting instructions are otherwise independent. Such storage conflicts constrain instruction dispatch and reduce performance.
It is known in the art that storage conflicts can be avoided by using register renaming where additional registers are used to reestablish the correspondence between registers and values. Using register renaming, the additional "physical" registers are associated with the original "logical" registers and values needed by the program. To implement this technique, the processor typically allocates a new register for every new value produced (i.e., for every instruction that writes a register). An instruction identifying the original logical register for the purpose of reading its value obtains instead the value in the newly allocated register. Thus, the hardware renames the original register identifier in the instruction to identify the new register and the correct value. The same register identifier in several different instructions may access different hardware registers depending on the locations of register references with respect to the register assignments.
With renaming, the example instruction sequence depicted above becomes:
In this sequence, each assignment to a register creates a new instance of the register, denoted by an alphabetic subscript. The creation of a renamed register for eax in the fourth instruction avoids the resource dependency on the second and third instructions, and does not interfere with correctly supplying an operand to the fifth instruction. Renaming allows the fourth instruction to be dispatched immediately, whereas, without renaming, the instruction must be delayed until execution of the second and third instructions. When an instruction is decoded, its result value is assigned a location in a functional storage unit (referred to herein as a reorder buffer), and its destination register number is associated with this location. This renames the destination register to the reorder buffer location. When a subsequent instruction refers to the renamed destination register, in order to obtain the value considered to be stored in the register, the instruction may instead obtain the value stored in the reorder buffer if that value has already been computed.
The use of register renaming in the reorder buffer not only avoids register resource dependencies to permit out-of-order execution, but also plays a key role in speculative execution. If the instruction sequence given above is considered to be part of a predicted branch, then one can see that execution of those instructions using the renamed registers in the reorder buffer has no effect on the actual registers denoted by instruction. Thus, if it is determined that the branch was mispredicted, the results calculated and stored in the reorder buffer may be erased and the pipeline flushed without affecting the actual registers found in the processor's register file. If the predicted branch affected the values in the register file, then it would be difficult to recover from branch misprediction because it would be difficult to determine the values stored in the registers before the predicted branch was taken without the use of redundant registers in the reorder buffer.
When a result is generated by an execution unit, it is written back to the reorder buffer. The result may also provide an input operand to one or more waiting instructions buffered in the dispatch buffer, indicating that the source operand is ready for dispatch to one or more execution units along with the instructions using the operand. After the value is written into the reorder buffer, subsequent instructions continue to fetch the value from the reorder buffer, unless the entry is superseded by a new register assignment and until the value is retired by writing it to the register file.
After the processor determines that the predicted instruction flow is correct, the processor commits the speculative results of those instructions that were stored in the reorder buffer to an architectural state by writing those results to the register file. This process is known as retirement wherein the instructions are architecturally committed or retired according to their original program order (i.e. the original instruction sequence).
III. Event Signaling In Conventional Processors
Depending upon the particular type of event that occurs, a conventional processor, such as the i486.TM. and Pentium.TM. processors manufactured by Intel Corporation, will either log the event and update certain pieces of state without affecting the flow of control or abort execution of the microinstruction and all subsequent microinstructions and transfer microcode control to a microcode routine which handles the event. In the latter case, the routine may attempt to continue executing the flow associated with the instruction which caused the event once the condition is handled, or it may perform a system call (in Intel Architecture.TM. terms, a "far call") to a macrocode routine which handles the condition. In accordance with the Intel Architecture.TM., a far call is made when a defined exception such as a fault, trap, hardware interrupt or software interrupt occurs.
Generally speaking, the characteristics of a system call operation vary widely based on the mode of the processor. Such characteristics include the number and size of items pushed onto the stack, the method used to obtain a pointer to the system call routine which handles the event, the size and format of the system call routine pointer, and various other checks performed (in fact, this far call may itself cause an event). As a result of these differences, the handler call microcode flow must determine the mode of the processor and select the appropriate microcode flow to perform the system call operation and branch to that flow.
Hence, in order to perform a system call to an event handler in signaling an event, at least two conditional branches must be taken. First, the microinstruction sequencer is instructed either directly by the functional unit through hardware, or indirectly via microcode, to perform a conditional jump to the handler call microcode flow. That is, a jump is requested on the condition that an event needs to be signaled. Second, a conditional microcode jump is then used to select the proper system call routine based upon the state of the microprocessor. It is further noted that assists and other events which do not require a system call may still require different actions that are dependent on the mode of the processor. As a result, they would also require a similar set of conditional branches to reach the appropriate microcode to handle the event.
However, a significant drawback in the method described above is that the required conditional jumps give rise to branch penalties that result in a performance loss for the processor. Because conditional branches are predicted before they are finally determined in the execute stage, the occurrence of a mispredicted branch will require that instructions issued after the conditional jump be flushed from the pipeline, thereby causing a severe performance penalty due to the time lost in fetching the improper instructions and in decoding new instructions subsequent to the flushing. In addition, the use of conditional jumps, whether properly predicted or not, may cause the pipeline to be stalled due to the time required by most processors in evaluating the condition and computing the destination.
Accordingly, it would be advantageous to provide in a processor a mechanism for signaling events that require mode dependent handling actions while minimizing branch penalties by avoiding the mode-dependent conditional branches needed to reach the appropriate microcode handler.
IV. Event Signaling In Speculative, Out-Of-Order Processors
The out-of-order nature of more contemporary processors also presents further problems in the signaling of events in such architectures. With respect to the pipe stages for the processing of instructions in an out-of-order microprocessor, the processor architecture can be functionally broken down into an in-order, front-end section and an out-of-order, back-end section as follows:
IN-ORDER, FRONT-END PA0 OUT-OF-ORDER, BACK-END
Because the fetch, decode, rename and allocate stages are disjoint from the execute, writeback and retirement stages of the processor, communicating events that occur in the in-order, front-end to the out-of-order, back-end of the processor becomes quite difficult. This is because the decoded instructions are allocated reorder buffer entries only at the boundary between the in-order and out-of-order sections of the processor. Thus, if an event is detected by a front-end functional unit, signaling the event directly to the back-end functional units is prevented since no reorder buffer entry has yet been allocated for the corresponding instruction.
More importantly, events are harder to handle when the execution of instructions is out-of-order. Due to the fact that instructions are issued in-order, executed out-of-order, and retired in-order, an execution unit cannot simply signal an event at writeback and expect it to be handled at that time. This is because events in an out-of-order architecture must be handled precisely; that is, all preceding microinstructions must have executed and retired successfully, while all following microinstructions may have to be canceled. Therefore, it would be desirable to provide a means for communicating to the retirement logic that an event occurred during execution of a microinstruction in addition to a means for storing event related information until the event can be handled by the retirement logic.
Additionally, when a unit detects an event during the execution of a microinstruction, it is possible that that same execution will be called upon later to execute a microinstruction issued prior to the one it is currently executing at a later time. Yet, this second microinstruction could also generate an event. This therefore complicates the operation of the execution unit since it must now figure out for each microinstruction which causes an event whether that microinstruction was issued before or after the previous one which caused the event. Hence, if the one generating the event occurs before the one which previously caused the event, the new event information would also have to be saved.
A further complication is the fact that certain actions, including state information updates, should occur whenever certain classes of events occur. For instance, events which require microcode handlers must have a pointer to the microinstruction which caused the event saved for them. Also, several event types require that various units be signaled as to their occurrence and that various buffers be drained. Thus, aside from the problems associated with communicating events between the front-end and back-end of the processor and providing an appropriate storage means for buffering event related information, some means would also have to be provided for saving and updating appropriate state information.
Accordingly, it is an object of the present invention to provide a method and apparatus for signaling events in a microprocessor via a microinstruction in order to minimize the use of mode-dependent conditional jumps and their attendant branch penalties.
It is another object of the present invention to provide a method and apparatus in an out-of-order microprocessor for signaling microcode detectable events occurring in the in-order, front-end to the out-of-order, back-end of the microprocessor in order to reduce the complexity and expense of implementing dedicated hardware that would otherwise be required for updating the appropriate functional units.
It is a further object of the present invention to provide a method and apparatus for signaling events in an out-of-order microprocessor via a microinstruction which utilizes the architecture of the reorder buffer to post events in the out-of-order, back-end of the microprocessor.
It is yet another object of the present invention to provide a mechanism by which microcode can specify values to be placed in the data, flags, and event fields of a physical destination register located in the reorder buffer via a microinstruction executed by an execution unit after an event has been detected.