This invention relates to the architecture and operation of an improved processor in which trapping conditions are identified at a decode stage to expedite operations in a pipelined superscalar processor.
Conventional processor designs commonly involve the control of instructions in three stages--fetch, decode, execute. In the first stage, an instruction is fetched from memory at a location identified by a program counter which points to the latest fetched instruction, thereby allowing the next instruction to be fetched. Following the fetch, the instruction is checked for possible data dependencies and, if it passes the test, the instruction and its operands are then issued for execution. (Data dependencies are circumstances where an instruction cannot be executed because data for the instruction is not yet available.) The instructions issued can be identified by a virtual-issue program counter. Once an instruction is decoded and issued, it is sent to the execution stage, where it produces a result that is written into either a register file or a memory, thereby altering the state of the processor.
More recent advanced processors include an element called a register scoreboard which checks resources for an instruction to see if the required resources are available for the instruction to execute. If so, the instruction is issued even before the instruction in the execution stage has finished, which can result in out-of-order execution. The register scoreboard records (locks) the resources that would be modified by the instruction at issue time. Any subsequent instructions that want to access those resources cannot be issued until the instruction that initially locked them subsequently unlocks them by updating the resources, and so notifying the processor.
These known processor designs operate with the disadvantage that any stop in the issue of instructions, typically due to resource dependency among instructions, will stop the instruction fetch. This stopping results in loss of performance because fewer instructions are issued for execution.
Further, computer systems define hardware trapping conditions for many types of instructions. A hardware trapping condition is a condition that does not allow producing a meaningful result of execution for that instruction under the current conditions. Hardware trapping conditions typically include: illegal instructions, privileged instructions, protection violations, page faults, invalid results, floating-point instructions in the absence of an optional floating-point processor, and register set overflow.
Hardware trapping conditions are automatically tested by the computer system before and after every instruction execution. If a trapping condition is found, a trap is taken. Taking a trap consists of the steps of preventing the trapping instruction and all those that follow it from affecting the state of the machine, saving the address of the trapping instruction in a specified trap address register well-known to computer designers, setting the mode of the machine to Supervisor (privileged) mode, and switching to fetching instructions from a trap handler. The address of the trap handler is typically determined by concatenating a constant that depends on the type of trapping condition to the contents of a special state register, the Trap Base Register (TBR).
Another form of trap is the software trap. Many computer systems have special instructions, a side effect of which is to force a trap. These are used by a user mode program to request services of the operating system (a "system call"). The steps taken on detecting a software trap are the same as those just described for hardware traps, except that the address of the trap handler is a variable that is computed by the trap instruction itself, rather than a constant.
Almost all computer system implementations are pipelined. In a pipelined implementation, each instruction takes a minimum of five to seven clock cycles to execute completely. A typical sequence of steps is: fetch the instruction from an instruction cache, decode the instruction to determine its resource requirements, read the instruction's operand values from a register file, execute the instruction to produce a result; this takes from one to three clocks, and write back the result to the register file.
In order to keep the hardware maximally utilized, a new instruction enters the pipeline every clock cycle whenever possible. In this way, the instruction completion rate can be as high as one per clock cycle, even though each individual instruction requires several cycles to completely execute.
Some trapping conditions depend on the result of execution, and so cannot be tested until after the execution stage. For these instructions, the test is made during the write-back cycle, in time to prevent updating the register file with the invalid result. Whenever a trapping condition is detected, all instructions executed before the one that traps have been through the write-back stage, and so their results have updated the register file. All instructions that follow the trapping one are still in the pipeline. Taking a trap in a pipelined machine is a simple matter of flushing all instructions from the pipeline and forcing the fetching stage to switch to the trap handler address.
Taking a trap at the write-back stage represents a degree of inefficiency due to the fact that the hardware was applied to the instructions following the trapping one that are flushed. In a conventional processor the number of instructions flushed in taking a trap is approximately equal to the length of the pipeline. This penalty is so small that for simplicity, all traps are handled in the same manner. An instruction that cannot be executed, e.g. because it is unrecognized or privileged, is simply marked as trapping at some earlier stage of the pipeline. This trapping mark accompanies the instruction through the pipeline until it is recognized and acted on at the write-back stage.
In a superscalar out-of-order processor, however, the inefficiency due to taking a trap at the write-back stage is significantly greater. Simply stated, such a processor differs from a conventional pipeline by the insertion of a buffer (the "instruction shelf") after the decode stage and another buffer (the "result shelf") after the execution stage. Each buffer can hold several instructions, each of which is held in the buffer for an indefinite length of time waiting for the right combination of conditions to let it move to the next stage.
In the superscalar processor such as disclosed in U.S. patent application Ser. No. 07/622,893, filed Dec. 5, 1990 for "Processor Architecture Having Decoupled Fetching, Issuing and Updating Operations and Speculative Branching," up to 64 instructions can be held in these buffers between the first (fetch) and last (write-back) stage of the pipeline. Therefore, detecting and acting on a trap at the write-back stage can result in having wasted dozens of clocks of execution bandwidth on instructions that are flushed. Certain types of traps cannot be detected until after the execution stage. For these, the wasted effort is inevitable. But many types of traps can be detected at the fetch or decode stages.
In prior-art out-of-order processors, traps detected at the fetch or decode stages are handled just as in traditional pipelined machines. The instruction is marked as trapping and then launched into the machine like any other. It sits on the instruction shelf waiting for a free execution unit so that it can be moved to the result shelf, where it must wait until all instructions fetched earlier have gone through the write-back stage. During all this time, the fetch and decode stages are continuing to launch instructions that follow the trapping one, even though the decode stage knows that all those following instructions will be flushed.
Many of the types of hardware traps that are detected at the decode stage are of the "shouldn't happen" variety: illegal instructions, privilege or protection violations. Other trap types, like page faults, that are detected at the fetch stage entail such a large overhead in the trap handler that even a few dozen wasted instructions are insignificant. But there is a class of trap--software traps and register set overflows--that are not a programming error, are expected to happen, and entail a low overhead in the trap handler. The cumulative effort wasted while waiting to take traps of this class at the write-back stage in a superscalar out-of-order processor can be significant.