The present invention pertains to techniques for implementing static speculative instructions and, more particularly, to an architecture for scheduling groups of speculative instructions to be issued and for enabling multiple-issue, static speculation.
Superscalar processors are uniprocessor organizations that are capable of increasing machine performance by executing multiple scalar instructions in parallel. Since the amount of instruction-level parallelism (ILP) within a basic block is small, superscalar processors must look across basic block boundaries to increase performance. Unfortunately, many of the branches in non-numerical code are data-dependent and cannot be resolved early. Thus, speculative execution--the execution of operations before unresolved previous branches, is an important source of parallelism in this type of code.
Instruction-level parallelism can be extracted statically (at compile-time) or dynamically (at run-time). Statically-scheduled instruction level parallelism processors, such as Very Long Instruction Word (VLIW) machines, exploit instruction-level parallelism with a modest amount of hardware by exposing the machine's parallel architecture in the instruction set. For numerical applications, where branches can be determined early, compilers harness the parallelism across basic blocks by utilizing techniques such as software pipelining or trace scheduling. However, the overhead and complexity of speculative computation in compilers have prevented efficient parallelization of non-numerical code.
Dynamically-scheduled superscalar processors, on the other hand, effectively support speculative execution in hardware. By using simple buffers, these processors can efficiently commit or discard the side effects of speculative computation. Unfortunately, the necessary additional hardware to look far ahead in the dynamic instruction stream, find independent operations and schedule these independent operations out of order is costly and complex.
Recent studies have shown that multiple-issue processors will rarely achieve speed-ups exceeding a factor of two over pipelined architectures, without some degree of speculation. With speculation, speed-ups ranging from a factor of three to six are possible. Also, further speed-up is possible if speculation is performed simultaneously on multiple paths.
Dynamic speculation, as implemented by superscalar architectures, can achieve speed-ups of approximately a factor of three. However, because of the complexity of the dynamic scheduler (the dispatch mechanism), it is unlikely that performance will improve much beyond this number.
However, a Very Large Instruction Word (VLIW) Computer implementing static, rather than dynamic, speculation, could avoid this complexity by scheduling ahead of time, at compile-time. Such an architecture is capable of achieving speed-ups greater than a factor of three.
Moreover, although dynamic speculation down multiple paths requires a prohibitively great amount of hardware, static speculation can exploit multi-path speculation without requiring a significant amount of additional hardware.
An instruction is considered to be issued speculatively, if it is not known at the time of issue whether the instruction should have been issued. Translated code running on statically-scheduled architecture must report traps (e.g., divide by zero, overflow, etc.) exactly as would the original code if processed in order (non-speculatively). This definition becomes clearer if a transformation such as the one in Sequence A is considered.
______________________________________ L0: L0: branch . . . I0*: r7 = r1 * r9 L1: branch . . . branch . . . ==&gt; L1: branch . . . L2: I0: r7 = r1 * r9 L2: Original non-speculative transformed Speculative Sequence A: Speculation of instructions ______________________________________
The instruction I0 should have been issued in basic block L2. However, it was speculatively issued in block L0. Issuing an instruction speculatively involves moving it past one or more branches. An instruction that was moved past a branch is said to have been speculated past the branch. The original position of the instruction in the sequential program is called its "origin point". The position at which the instruction is issued speculatively is its "issue point".
During execution, an instruction "fails" at a branch if it was speculated past it, and the branch takes a direction not leading to the origin point of the instruction. Otherwise it is said to have "succeeded" at the branch. An instruction succeeds if it succeeds at all branches (i.e., the program actually executes the basic block in which the origin point lies). The instruction fails if it fails at any branch. The instruction is "resolved" if it either succeeds or fails.
Consider the division of instruction execution into two stages: compute and register update. Instruction issue and the calculation of the results are performed during the compute stage. During the update stage, the user-visible state of the registers is modified to reflect the results, and any trap (i.e., error) that occurred is reported. The term "state" or "machine state" includes the program counter, the registers and the memory. In this context, the intent of speculation is to perform the compute stage early, but delay the update stage until it is known that the instruction would have been executed (i.e., that no errors or traps occurred during its or some prior instruction's execution, and that instruction succeeds).
A "speculative instruction", in mechanisms proposed to implement static speculation, performs only the compute stage. Such a speculative instruction computes the result; it does not report any trap that occurred during the calculation, but merely buffers it. Another instruction actually performs the update stage for the speculative instruction. This instruction may perform multiple updates for several speculative instructions simultaneously. It will report any previously buffered trap. If the mechanism buffers results, it will update the register (i.e., the user-visible state) with these results.
If a speculative instruction encounters a trap during its execution, it is said to have "trapped", even though the trap will not be reported until later, if at all. The point at which the update for a previously issued speculative instruction occurs is known as the "commit point"; the instruction which causes the update is known as its "commit instruction". The commit point is usually in the same basic block as the origin point. Thus, a speculative instruction is said to succeed if its commit instruction is executed.
Merely reporting interrupts (i.e., traps) is not enough. The interrupt and the machine state at the time that it is reported must provide enough information for the user to determine the cause of the trap and, possibly, to resume execution after correcting the cause. In sequential processor architectures that have no static speculation, the commonly used model for reporting interrupts is the precise interrupt model. Precise interrupts facilitate debugging and restarting a program after an interrupt.
An interrupt is precise if, at the time it is reported to the user, the machine state reflects the following conditions:
a) the program counter points to the instruction which caused the trap; and PA1 b) all instructions that preceded the trapping instruction in the program have executed without a trap and have correctly modified the state; and PA1 c) all instructions succeeding the trapping instruction are unexecuted and have not modified the state.
An instruction that is issued speculatively may trap. This trap cannot be reported until it is known that the instruction would have executed. Thus, a scheme must delay reporting traps caused by speculative instructions, but should permit such traps to be determined and reported later. Typically, such a scheme involves a cooperative effort of both hardware and software, adding hardware to enable delayed interrupt reporting, and requiring the software be written so as to ensure that the interrupts are reported correctly.
Detecting traps is not sufficient. In many cases, it may be important to modify the state, with a technique called "restarting". In speculative architectures, speculative instructions must then be re-executed if they have been issued but not yet resolved (succeeded or failed), using this modified state.
Two major schemes for static speculation are those known as "boosting" and "poison-bit". The poison-bit scheme has been described in "Some Design Ideas for a VLIW Architecture for Sequential Natured Software", by Kemal Ebcioglu in Parallel Processing, pp. 3-21, April, 1988. A structured way of compiling, for this class of architectures, called sentinel scheduling, is described in "Sentinel Scheduling for VLIW and Superscalar Processors", by Scott A. Mahlke et al., Fifth International Symposium on Architectural Support for Programming Languages and operating Systems, 1992.
Briefly, each register has an extra bit, known as the poison bit. If a speculative instruction traps, then its destination register is poisoned (i.e., the poison bit is set). Should another speculative instruction read a poisoned register, then its destination register is also poisoned. When a non-speculative instruction reads a poisoned register, it signals a trap. Writing to a register clears the poison bit, if it had been set.
This scheme automatically delays reporting traps caused by speculative instructions. Reading the destination register of a speculative instruction can be used to determine whether that instruction trapped.
Boosting was first introduced in "Boosting Beyond Static Scheduling in a Superscalar Processor", by Michael D. Smith et al., in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 344-354, May, 1990.
An instruction that is to be executed speculatively is labelled with the number of branches it was moved past. For example, in Sequence B, instructions I0 and I1 are moved past one branch each; so each is labelled with a "0.1". Instruction I2, which was moved past two branches, is labelled "0.2".
______________________________________ L0: L0: branch . . . I0*: r1.1 = r2 & r3 L1: I2*: r7.2 = r8 * r9 I0: r1 = r2 & r3 ==&gt; branch . . . branch . . . L1: L2: I1*: r4.1 = r1 + r6 I1: r4 = r5 + r6 branch . . . I2: r7 = r1 * r9 L2: Original non-speculative Boosted speculative Sequence B: Static speculation using boosting ______________________________________
Each speculative instruction is associated with a unique branch. For an instruction with label .N, this is the N.sup.th branch from the speculative instruction. For example, I2 and I1 are associated with the second branch. Actually, each branch has a preferred side, either the taken side or the not-taken side. Instructions can be speculated only from the preferred side. A speculative instruction with label .N is associated with the Nth branch on the path traced using the preferred side of the N-1 other branches on the path. In both of the branches shown, the preferred side is the not-taken side.
A trap caused by a speculative instruction is not reported immediately; instead, it is buffered until the branch associated with the instruction is executed. If the branch is not executed in the preferred direction, all of the buffered traps are thrown away. However, the branch is resolved in the preferred direction, and if a speculative instruction associated with the branch had trapped, then a trap is reported. For example, if I2 trapped, then the trap would not be reported until the second branch was executed. If the branch was not taken (i.e., it went in the preferred direction), a trap would be signalled.
Boosting postpones reporting a trap caused by a speculative instruction until the associated branch is executed. If that branch is executed in the preferred direction, a trap is automatically reported. A substantial amount of hardware is required to implement this scheme. This includes replicating the register file and possibly adding circuitry to ensure that a speculative instruction reads the correct values from a register.
U.S. Pat. No. 5,072,364, issued to Jardine et al. for "Method and Apparatus for Recovering from an Incorrect Branch Prediction in a Processor that Executes a Family of Instructions in Parallel", describes a method and apparatus for recovering from an incorrect branch prediction. Contiguous blocks of instructions, referred to as a family, are fetched. If no hardware resource or data conflicts exist, the instructions in a family are issued in parallel. Otherwise, they are issued serially. When any instruction in a family traps, all of the instructions in a family are flushed and reissued sequentially. If the family included a branch instruction and some instructions from the predicted side of the branch, and the branch was mispredicted, the family is flushed and reissued sequentially. The aforementioned patent deals with a minimal amount of dynamic speculation. Both instructions entail conditional execution of instructions and flushing their effects after a mispredicted branch.
U.S. Pat. No. 5,172,091, issued to Boufarah et al., for "System for Reducing Delay in Instruction Execution by Executing Branch Instructions in Separate Processor while Dispatching Subsequent Instructions to Primary Processor", describes a system for reducing delay in instruction execution. The system affects the instructions that are fetched and issued immediately after a branch is encountered. As long as the branch direction (i.e., taken/not-taken) is not known, the instructions immediately succeeding the branch (i.e., the not-taken side) are issued conditionally. As soon as the direction is known, instructions from that side begin to be issued. If the branch was taken, the conditionally issued instructions are flushed.
Both of these patents perform speculation by the hardware, at run-time. Moreover, neither of the patents deals with out-of-order issuance of instructions; the originally issued order of instructions is maintained.
The greatest distance of a speculated instruction in dynamic speculation furthest from a branch is less than the size of the pipeline plus the speculation window size. In static speculation, this is not possible. Statically-speculated instruction results must be buffered. This, of course, results in an increased complexity of implementation.
It would be advantageous for a compiler to schedule instructions so as to guarantee that the number of registers that must be buffered does not exceed the amount of buffer storage available.
It would also be advantageous for hardware not to report traps of speculative instructions until they are checkpointed.
It would also be advantageous not to issue stores to memory until checkpointed.
It would further be advantageous for the registers, traps and stores to be flushed or issued en masse.
It would also be advantageous to suppress execution of further instructions from a group when a speculative instruction causes a trap.
It would also be advantageous to facilitate restarting of the program after correcting or modifying the state or after performing any other actions necessary to recover from the trap.