With processors, control flow instructions (such as branch instructions), latency due to memory transactions, and instructions requiring mural-cycle operations often prevent the processor from sustaining peak instruction execution bandwidth because "bubbles" are introduced into the pipeline. Performance can be improved by implementing speculative out-of-order instruction execution to enhance performance. Conventionally, when an intermediate result from an instruction is not available for a subsequent instruction, the processor ceases execution, or "stalls" until that intermediate result is available.
Two techniques, speculative execution and out-of-order execution, help to maintain high execution bandwidth in modern processors. Speculative execution is a known technique in which, when a branch is encountered without the information from an earlier process step to select the appropriate branch, a prediction is made and instruction dispatch and execution proceeds based on that prediction. If the prediction later proves to be incorrect, the mispredicted instruction sequence must be undone using branch mispredict recovery. Speculative execution allows the processor to continue issuing and executing instructions. Known prediction schemes minimize the frequency of mispredicted execution in order to improve performance. However, maintaining precise state in a speculative machine is complicated and incurs undesirable overhead. Out-of-order execution is a technique that hides memory and multi-cycle instruction latency. In processors implementing out-of-order execution, instructions are dynamically reordered and executed in a different order than the sequential program order to reveal available instruction level parallelism.
A "precise exception" model has been shown to be an important feature for simplifying software resolution of exception conditions, but maintenance of precise exceptions is complicated in machines implementing speculative out-of-order execution. Machine state is generally processor specific and architectural state includes all control/status registers, data registers, address registers and all external memory state. For example, SPARC-V9 control/status registers are identified on pages 29-30 of the SPARC-V9 Architecture Manual. It is what the software. and software programmer sees. Machine state is a super-set of architectural state that is processor specific and includes everything else about the state of the machine. A faulling instruction is an instruction which generates an exception. An exception is any situation or condition that tells the processor to stop and investigate the situation that caused the exception before proceeding. An exception need not be an error condlon and includes interrupts for example. Execution traps may result from exceptions. In a processor implementing a precise exception model, a fault or exception does not modify architectural state. Architectural state has been modified for all instructions prior to the faulting instruction, but architectural state has not been modified for instructions after the faulting instruction. When a precise excepton model is not provided, the software must identify the faulting instruction and then calculate a restart point for either retrying the faulting instruction or bypassing the faulting instruction and executing the next instruction.
Precise state maintenance techniques for short-pipelined, single-issue machines are known. Generally, a short-pipelined machine has fewer than about four or five stages including an instruction fetch, issue, execute, and write-beck stage wherein state is modified. Single-issue implementations simplify recovery in the event of an exception or misprediction because the pipeline may be cleared without worrying about which instruction in a pipeline stage should be flushed. In these conventional techniques, any exception that may have occurred is detected prior to modifying architectural state. When an exception is detected, the pipeline is flushed of instructions and any writeback of data, status, or results to architectural state is intentionally blocked to prevent modification of architectural state.
In speculative out-of-order superscalar (multi-issue) implementations, maintaining precise state is much more difficult than for a single-issue machine. In such speculative out-of-order machines, instructions which generate errors may execute speculatively and method and structure must be provided to undo any architectural state modification which occur after the location of the fault. Also, exceptions may generally be detected in a different order than program order. Therefore, an out-of-order processor must be able to unscramble the exceptions and determine which instructions should be allowed to complete (and modify architectural state) and which instructions should be undone.
FIG. 1 shows a conventional approach using a re-order buffer for maintaining precise state in a speculative processor. The reorder buffer is implemented by a first-in/first-out stack that effectively introduces a temporal delay between instruction execution completion (when the result is available) and the time when machine state is modified as the result of execution completion. Precise state is maintained by preventing writeback to memory of a mispredicted branch instruction, execution exception, or like condition. Precise state is maintained by preventing state modification while the instruction is still speculative rather than by restoring state to undo mispredicted speculative execution.
FIG. 2 shows a conventional approach for maintaining precise state in a speculative processor by storing state in a "checkpoint" prior to executing an instruction and later restoring state from the stored checkpoint information. In conventional checkpointing, every speculatively executed instruction that may result in modification of machine state is check pointed. Each processor has a set of state parameters, including for example, all control/status registers, data register values, and the like, that define the state of the machine. In order to restore a previous machine state, all of the state defining parameters must be stored so that they can be restored if and when the need arises. In conventional checkpointing techniques, every state defining parameter is typically stored for every instruction that may modify any one of the state defining parameters. For every checkpointed instruction, conventional checkpointing stores all state information that may be changed by any one of the checkpointed instructions, and not just the state that may be modified by that particular instruction about to be executed. For exampie, in a machine requiring 100 state parameters to define machine state, if execution of instruction "X" may modify only one control/status register, execution of that instruction will still require storing 100 state parameters, not just the one that may be modified by that instruction.
FIG. 3 shows the structure and content of an exemplary sequence of instructions and conventional checkpoints in diagrammatic fashion for purposes of illustration only and do not represent actual checkpoint data for any parlicular machine or CPU. Since conventional checkpoints are fixed size, this means that each checkpoint must be relatively large compared to the state actually modified by a particular instruction. Recovery using conventional checkpointing includes restoring state using a stored checkpoint closest to but prior to the faulting or mispredicted speculative instruction, backing up the program counter to just after the checkpointed instruction, and re-executing instructions from the checkpointed instruction forward.
Checkpointing, re-order buffers, history buffers, and the use of a future file have been described as ways to manage out-of-order execution and maintain precise state in some circumstances. Conventional Checkpointing which permits restoration of machine state at a checkpoint boundary, but not at an arbitrary instruction boundary, is described by Hwu and Patt (W. Hwu and Y. N. Patt, "Checkpoint Repair for High Performance Out-of-order Execution Machines", Proceedings of the 14th Annual Symposium on Computer Architecture, (June 1987), pp. 18-26.). Methods of implementing precise Interruptions in pipelined RISC processors are described by Wang and Emnett ("Implementing Precise Interruptions in Pipelined RISC Processors", IEEE Micro, August 1993, pp.36-43.). Methods of implementing precise interrupts in pipelined processors are also described by Smith and Pleszkun (J. E. Smith and A. R. Pleszkun, "Implementation of Precise Interrupts in Pipelined Processors.", Proceedings of the 12th Annual International Symposium on Computer Architecture, (June 1985), pp. 36-44.). An overview of conventional techniques for designing high-performance superscalar microprocessors is provided by Mike Johnson (M. Johnson [a.k.a. William Johnson], Superscalar Microprocessor Design, Prentice-Hall, Inc., Englewood Cliffs, N.J. 07632, 1991, ISBN 0-1.3-875634-1.). Each of these references are hereby incorporated by reference in their entirety.
Although at least some of these techniques improve performance, they are not entirely satisfactory because they either limit the degree of speculation or they allow only coarse machine state restoration not instruction level machine state restoration. Conventional re-order buffer techniques limit the degree of speculation because the re-order buffer length is linear with respect to the degree of speculation supported. That is, for each instruclion that may have to be undone, the instruction execution result must be stored in the re-order buffer. For example, if the machine permils sixty-four outstanding and potentially speculative instructions, the re-order buffer must contain at least sixty-four locations for storing results. In conventional checkpointing, the number of checkpoints checkpointed are typically fewer than the number of outstanding instructions in the machine, but the amount of data stored in each checkpoint is very large (the entire state of the machine at the time). The checkpoint storage requirements place a burden on processor chip substrate area. Ideally, a scheme for maintaining precise state should be scalable, at a linear or lower order relationship (e.g. logarithmic) to larger numbers of concurrently outstanding speculative instructions. Reorder buffer entry storage areas must have sufficient width to store the data and these techniques also require associative lookup to all entries. This can be difficult for very large reorder buffers.
Furthermore, conventional methods of restoring precise state are incomplete in some environments. For example, where an instruction executed as the result of a speculative branch modifies the state of an external "dumb" device, where restoration of state typically involves restoring state at a checkpoint boundary prior to the point of the modification to the external device and then forward re-execution of non-faulting instructions including the instruction that modifies the external device. In such cases re-executing the faulting instruction alone will not undo the effect of the change in the state of the external device and the re-execution of the non-faulting instructions between the point where state is restored and the faulting instruction causes further problems.
Conventional checkpointing saves machine state at a limited number of different points in the instruction stream and restores the checkpointed architectural state in order to recover from branch mispredictions or execution errors. Conventional checkpointing does not checkpoint each instruction. Conventional checkpointing can restore machine state only at checkpoint boundaries, that is at the instruction for which a checkpoint was established. When a fault or execution error occurs, the checkpointed state is restored by known techniques, thereby in effect "undoing" all instructions subsequent to the checkpoint, and thus including the faulting instruction(s). Then the instructions are re-executed serially forward, for example in a single-issue mode, from the checkpointed instruction to reach the faulting instruction.
This conventional checkpointing technique allows speculative execution but is disadvantageous in many respects. For example, it does not provide robust exception recovery. On intermittent errors, conventional backup of the machine to the checkpoint prior to the exception creating instruction and then re-execution of the instructions following the checkpoint may not result in deterministic behavior and may compound machine state disruption by propagating erroneously altered state in the re-executed instructions. Conventional checkpointing and machine backup procedures do not attempt to minimize instruction re-execution. Furthermore, catastrophic machine errors such as hardware failures and machine time-outs may deadlock the processor if all of the instructions between the restored checkpointed instruction and the instruction causing the exception are re-executed.
Conventional "re-order buffers" manage speculative instructions in a re-order buffer which is a first-in-first-out (FIFO) memory structure of a predefined size. When an instruction completes execution, data value results from the execution are written into the re-order buffer and as they move through the buffer and emerge at the top, the data values are written from the re-order buffer into a register file. The size of the re-order buffer effectively defines the delay between completion of execution of the instruction and permanent modification of architectural state. Once the data values are wdtten to the register they cannot be undone. "Associative lookup," a procedure by which entries in the re-order buffers are associated with entries in the register file, is discussed in M. Johnson, Superscalar Microprocessor Design at page 49 et seq.
Re-order buffer schemes have at least three limitations. First, in conventional re-order buffer schemes, only the results of instruction execution are saved in the re-order buffer and the Program Counter (PC) values are not saved. Therefore, any branch mispredict recovery using re-order buffers requires the additional steps of PC reconstruction, instruction fetch, and instruction issue. As a result, conventional recovery from a branch mispredict using re-order buffers is delayed.
Second, re-order buffers generally only allow speculative execution of a limited mix of instructions. For example, traps detected during the instruction issue stage (issue traps) may not be speculatively executed using re-order buffers because they generally involve control register updates. Re-order buffer techniques do not support speculative execution of an instruction involving control register updates. The inability to speculatively enter issue traps in certain instruction set architectures (e.g. "spill/fill" traps in the Sun Microsystems SPARC architecture) can impose significant performance limitations.
Third, re-order buffer size is generally a direct linear function of the number of outstanding instructions allowed in the processor. For example, in the processor that allows sixty-four outstanding instructions a re-order buffer having sixty-four entries would be required. Allocating the large number of buffer registers within the processor can be prohibitive, especially in dataflow processors where large active instruction windows, that is a relatively large number of concurrently outstanding instructions allow improved extraction of instruction level parallelism. Dataflow processors are processors where the order of instruction execution is determined by the operand or data availability, rather than based on incrementing the program counter as in conventional non-data flow processors.
Future files are a modification of the re-order buffer technique that avoids the associative lookup problem in the reorder buffer. Future Files are described in M. Johnson's Superscalar Microprocessor Design at pages 94-95. History buffers are a method and structure proposed for implementing precise interrupts in a pipelined scalar processor with out-of-order completion and is described in M. Johnson's Superscalar Microprocessor Design at pages 91-92.
"The SPARC Architecture Manual", Version 9, by D. Weaver and T. Germond, Englewood Cliffs (1994), which is hersby explidtiy incorporated by reference in its entirety, describes a particular type of out-of-order speculative execution processor. The SPARC V9 architecture requires a Floating-Point State Register (FSR). The FSR contains 3 fields, the FSR.sub.-- accrued.sub.-- exception (FSR.aexc) field, the FSR.sub.-- current.sub.-- exception (FSR.cexc) field, and the FSR.sub.-- fioating.sub.-- point.sub.-- trap.sub.-- type (FSR.ftt) field, which are updated when a floating point exception occurs and are used by a trap handling routine for handling a trap caused by the floating point exception. The updating of these fields is difficult in an out-of-order execution processor because instructions execute and complete in a different order than program order. Since these fields need to be updated or appear to be updated as if instructions are issued and executed in program order, an apparatus and corresponding method is required to track these exceptions and correctly update the FSR register.
For data processors which can execute instructions speculatively, branch direction (taken or not-taken), or branch address (target address or address next to the branch instruction) can be predicted before they are resolved. Later, if these predictions turn out to be wrong, the processor backs up to the previous state and re-starts executing instructions in the correct branch stream. However, most superscalar processors available in the market can evaluate only one branch per cycle to check whether or not branch predictions are correct. But, multiple predicted branches are often ready to evaluate in one cycle. Thus, branch evaluations which could otherwise be performed need to be delayed. Delaying branch evaluations affects the processor performance significantly.
Furthermore, in conventional speculative execution processors, when a trap occurs, the processor has to wait until all predicted branches have been resolved to make sure that the trap is real and not speculative. The easiest way for a processor to make sure that a trap is real is to synchronize the processor (i.e., execute and complete all instructions issued prior to the occurrence of the trap condition) before taking the trap. However, doing so for traps that occur frequently degrades the performance of the processor. This is especially true in the SPARC-V9 architecture where spill/fill issue traps and software traps (Tcc instructions) often occur. This problem needs to be resolved in order to increase processor performance.