Computer Architecture
Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor's basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages.
The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages. To obtain a through put in excess of one instruction per cycle, multiple instructions may be issued and executed per cycle. The adjective "superscalar" is commonly applied to a non-vector processor having such attributes. Superscalar processors require a high-performance memory interface and multiple execution units.
The ability to increase throughput via pipelining is limited by situations called pipeline hazards. Hazards may be caused due to resource or data dependencies that arise due to the overlapping stages of instruction processing inherent in the pipeline technique. When a resource or data hazard occurs, the inter-stage advance of instructions must be stalled until the hazard is no longer present. Otherwise, improper operation would result. To prevent such incorrect behavior, "interlock" logic is added to detect any hazards and invoke a pipeline stall. While the pipeline is stalled, there are stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is also used to describe this condition. The throughput of the processor suffers whenever such bubbles occur. Hazards may also be caused due to unanticipated deviations from sequential control flow. Such control hazards are discussed infra.
Pipelining and superscalar issue and execution are viewed as architectural techniques for improving performance over what can be achieved via process or circuit design improvements. Pipelining was extensively examined in "The Architecture of Pipelined Computers," by Peter M. Kogge (McGraw-Hill, 1981). J. L. Hennessy and D. A. Patterson provide a contemporary discussion of pipelining, including superscalar approaches, in chapter 6 of "Computer Architecture, A Quantitative Approach" (Morgan Kaufinann, 1990). Recent superscalar pipelined machines include: the Intel 960 series, the Tandem Cyclone, the HP PA-RISC 7100, the IBM RSC, the Motorola 88110, the IBM RS/6000, the Cypress hyperSPARC (Pinnacle), the TI/Sun SuperSPARC (Viking), the DEC Alpha 21064, the Apple/IBM/Motorola PowerPC 601, the Intel Pentium Microprocessor, the SGI/MTI TFP, and the Apple/IBM/Motorola PowerPC 603.
Control hazards, associated with changes in control flow, were mentioned supra as limiting increased pipeline throughput. Programs may experience changes in control flow as frequently as one out of every three executed instructions. Taken branch instructions are a principal cause of changes in control flow. Taken branches include both conditional branches that are ultimately decided as taken and unconditional branches. Taken branches are not recognized as such until the later stages of the pipeline. If the change in control flow were not anticipated, there would be instructions already in the earlier pipeline stages, which due to the change in control flow, would not be the correct instructions to execute. These undesired instructions must be cleared from each stage. In keeping with the pipeline metaphor, the instructions are said to be flushed from the pipeline. Alternatively, all instruction processing following the branch could be stalled subsequent to recognizing the branch until its direction is resolved.
The instructions to be first executed where control flow resumes following a taken branch are termed the (branch) target instructions. The first of the target instructions is at the (branch) target address. If the target instructions are not introduced into the pipeline until after the taken branch is recognized as such and the target address is calculated, a pipeline bubble will result.
A variety of branch prediction techniques exist for predicting the direction of control flow associated with branches. Branch prediction is intended to reduce the occurrence of pipeline bubbles by anticipating taken branches. If a branch is predicted not-taken, the pipeline continues as usual for sequential control flow. If the branch is predicted taken, fetching is performed from the target address instead of the next sequential fetch address. By using branch prediction, many changes in control flow are anticipated, such that the target instructions of taken branches contiguously follow such branches in the pipeline. When anticipated correctly, changes in control flow due to taken branches do not cause pipeline bubbles and the associated reduction in processor throughput. Such bubbles occur, only when branches are mispredicted.
Recent works devoted to branch prediction include 1) "Branch Strategy Taxonomy and Performance Models," by Harvey G. Cragon (IEEE Computer Society Press, 1992), 2) "Branch Target Buffer Design and Optimization," by C. H. Perleberg and A. J. Smith, IEEE Transactions on Computers, Vol. 42, April 1993, pg. 396-412, and 3) "Survey of Branch Prediction Strategies," by C. O. Stjerifeldt, E. W. Czeck, and D. R. Kaeli (Northeastern University technical report CE-TR-93-05, Jul. 28, 1993).
Conventionally, instructions fetched from the predicted direction (either taken or not-taken) of a branch are not allowed to modify the state of the machine unit the branch direction is resolved. Operations normally may only go on until time to write the results in a way that modifies the programmer visible state of the machine. If the branch is actually mispredicted, then the processor can flush the pipeline and begin anew in the correct direction, without any trace of having predicted the branch incorrectly. Further instruction issue must be suspended until the branch direction is resolved. A pipeline interlock may be required to handle this control dependency. Thus, waiting for resolution of the actual branch direction is potentially another source of pipeline bubbles.
It is possible to perform speculative out-of-order execution past predicted branches or past other instructions stalled due to resource or data dependencies. This is done by providing additional state for reverting back to an earlier version of the machine state when required. Reversion to an earlier state is required upon determination that a branch was mispredicted or due to a desire to precisely resolve the occurrence of an interrupt with respect to the instruction stream. Speculative execution beyond an unresolved branch can be done whether the branch is predicted taken or not-taken. An unresolved branch is a branch whose true taken or not-taken status has yet to be decided. Such branches are also known as outstanding branches.
Speculative execution and out-of-order execution are closely related, and the terms are sometimes used interchangeably without distinction. Nevertheless, the two concepts are distinct. Out-of-order execution is the execution (and implied completion) of an instruction stream in other than strict sequential order. Out-of-order order execution is a form of "dynamic instruction scheduling" for circumventing pipeline stalls (bubbles). Speculative execution requires that the execution results be kept tentative until it is completely safe to permanently update the state of the processor. Speculative execution is always associated with either a history RAM, a "future" RAM, "relabeled" registers, or some similar arrangement. It is possible to perform carefully limited out-of-order execution that is not speculative. However, unrestricted out-of-order execution must be done speculatively, if a precise interrupt model is defined for the architecture. Out-of-order execution past unresolved branches must also be done speculatively, as improper operation would otherwise result on mispredicted branches.
Out-of-order execution is distinct from out-of-order issue, which is the issue (but not completion) of instructions in other than strict sequential order. It is possible to do in-order issue and out-of-order execution, and vice versa.
Speculative execution is also distinct from speculative issue. Speculative execution implies instruction completion and requires some means of tentatively storing the execution results. Speculative issue permits stalls related to control transfers and precise interrupts to be postponed until a latter pipeline stage than normally would be possible. As a result of the added delay, the hazard may be removed in time to avoid the stall. When a processor performs speculative issue past a branch, it may actually begin execution, but it doesn't execute to completion until after the associated predicted branch is resolved. This is because there is no means to back up the machine state should the branch be mispredicted. If the branch resolution occurs prior to the cycle in which the execution results for a speculatively issued instruction are scheduled to be written, the "execution" is no longer speculative. If the branch was correctly predicted, the result writing proceeds normally. If the branch was mispredicted, the pipeline is reset, "throwing away" the moot results. If the branch is not resolved in time, the pipeline must be stalled, because there is no means to restore the correct machine state should the branch be mispredicted. In a precise interrupt architecture, out-of-order speculatively issued instructions may be stalled from writing their results until it is determined that they may "safety" do so. That is, the results are written only when there is no possibility for an "intervening" interrupt. While many of the earlier mentioned superscalar pipelined processors perform speculative issue, it is believed that only the Motorola 88110 and the PowerPC 603 perform speculative execution to any extent.
The principles of out-of-order execution are well known in the art. As background, out-of-order execution in the IBM System/360 Model 91 was discussed in section 6.6.2 of Kogge. The January 1967 issue of the IBM Joumnal of Research and Development was devoted to the Model 91. More recently, the IBM Enterprise System/9000 520-based models performed speculative execution. J. L. Hennessy and D. A. Patterson provide an overview of out-of-order execution in chapter 6.
U.S. Pat. No. 5,226,126, ('126) PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, described speculative out-of-order execution in the system in which the instant invention is used, and is hereby incorporated by reference. The detailed description that follows will presume some degree of familiarity with '126.
U.S. Pat. No. 4,858,105 ('105) PIPELINED DATA PROCESSOR CAPABLE OF DECODING AND EXECUTING PLURAL INSTRUCTIONS IN PARALLEL, to Kuriyama et al., issued Aug. 15, 1989, teaches the optional execution of two instructions in parallel, including advancement of the instruction pointer. The pointer is advanced by a first instruction length, if only one instruction is executed, or is advanced by the sum of said first instruction length and a second instruction length, if two instructions are executed. However, '105 does not teach advancement of the instruction pointer in the context of speculative execution. As a result only one value for the next instruction pointer is produced, corresponding to executing either one or both instructions.
U.S. Pat. No. 5,204,953 ('953) ONE CLOCK ADDRESS PIPELINING IN SEGMENTATION UNIT, to Dixit, issued Apr. 20, 1993, discloses pipelined single-clock address generation for segment limit checking in an architecture compatible with that of the instant invention. Updating of the instruction pointer is not disclosed. Details of the segment limit check logic are not disclosed.