COMPUTER ARCHITECTURE
Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor's basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages.
The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages. To obtain a throughput in excess of one instruction per cycle, multiple instructions may be issued and executed per cycle. The adjective "superscalar" is commonly applied to a non-vector processor having such attributes. Superscalar processors require a high-performance memory interface and multiple execution units.
The ability to increase throughput via pipelining is limited by situations called pipeline hazards. Hazards may be caused due to resource or data dependencies that arise due to the overlapping stages of instruction processing inherent in the pipeline technique. When a resource or data hazard occurs, the inter-stage advance of instructions must be stalled until the hazard is no longer present. Otherwise, improper operation would result. To prevent such incorrect behavior, "interlock" logic is added to detect any hazards and invoke a pipeline stall. While the pipeline is stalled, there are stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is also used to describe this condition. The throughput of the processor suffers whenever such bubbles occur. Hazards may also be caused due to unanticipated deviations from sequential control flow. Such control hazards are discussed infra.
Pipelining and superscalar issue and execution are viewed as architectural techniques for improving performance over what can be achieved via process or circuit design improvements. Pipelining was extensively examined in "The Architecture of Pipelined Computers," by Peter M. Kogge (McGraw-Hill, 1981). J. L. Hennessy and D. A. Patterson provide a contemporary discussion of pipelining, including superscalar approaches, in chapter 6 of "Computer Architecture, A Quantitative Approach" (Morgan Kaufmann, 1990). Recent superscalar pipelined machines include: the Intel 960 series, the Tandem Cyclone, the HP PA-RISC 7100, the IBM RSC, the Motorola 88110, the IBM RS/6000, the Cypress hyperSPARC (Pinnacle), the TI/Sun SuperSPARC (Viking), the DEC Alpha 21064, the Apple/IBM/Motorola PowerPC 601, the Intel Pentium Microprocessor, the SGI/MTI TFP, and the Apple/IBM/Motorola PowerPC 603.
Control hazards, associated with changes in control flow, were mentioned supra as limiting increased pipeline throughput. Programs may experience changes in control flow as frequently as one out of every three executed instructions. Taken branch instructions are a principal cause of changes in control flow. Taken branches include both conditional branches that are ultimately decided as taken and unconditional branches. Taken branches are not recognized as such until the later stages of the pipeline. If the change in control flow were not anticipated, there would be instructions already in the earlier pipeline stages, which due to the change in control flow, would not be the correct instructions to execute. These undesired instructions must be cleared from each stage. In keeping with the pipeline metaphor, the instructions are said to be flushed from the pipeline. Alternatively, all instruction processing following the branch could be stalled subsequent to recognizing the branch until its direction is resolved.
The instructions to be first executed where control flow resumes following a taken branch are termed the (branch) target instructions. The first of the target instructions is at the (branch) target address. If the target instructions are not introduced into the pipeline until after the taken branch is recognized as such and the target address is calculated, a pipeline bubble will result.
A variety of branch prediction techniques exist for predicting the direction of control flow associated with branches. Branch prediction is intended to reduce the occurrence of pipeline bubbles by anticipating taken branches. If a branch is predicted not-taken, the pipeline continues as usual for sequential control flow. If the branch is predicted taken, fetching is performed from the target address instead of the next sequential fetch address. By using branch prediction, many changes in control flow are anticipated, such that the target instructions of taken branches contiguously follow such branches in the pipeline. When anticipated correctly, changes in control flow due to taken branches do not cause pipeline bubbles and the associated reduction in processor throughput. Such bubbles occur, only when branches are mispredicted.
Recent works devoted to branch prediction include 1) "Branch Strategy Taxonomy and Performance Models," by Harvey G. Cragon (IEEE Computer Society Press, 1992), 2, "Branch Target Buffer Design and Optimization," by C. H. Perleberg and A. J. Smith, IEEE Transactions on Computers, Vol. 42, April 1993, pg. 396-412, and 3) "Survey of Branch Prediction Strategies," by C. O. Stjernfeldt, E. W. Czeck, and D. R. Kaeli (Northeastern University technical report CE-TR-93-05, Jul. 28, 1993).
Conventionally, instructions fetched from the predicted direction (either taken or not-taken) of a branch are not allowed to modify the state of the machine unit the branch direction is resolved. Operations normally may only go on until time to write the results in a way that modifies the programmer visible state of the machine. If the branch is actually mispredicted, then the processor can flush the pipeline and begin anew in the correct direction, without any trace of having predicted the branch incorrectly. Further instruction issue must be suspended until the branch direction is resolved. A pipeline interlock may be required to handle this control dependency. Thus, waiting for resolution of the actual branch direction is potentially another source of pipeline bubbles.
It is possible to perform speculative out-of-order execution past predicted branches or past other instructions stalled due to resource or data dependencies. This is done by providing additional state for reverting back to an earlier version of the machine state when required. Reversion to an earlier state is required upon determination that a branch was mispredicted or due to a desire to precisely resolve the occurrence of an interrupt with respect to the instruction stream. Speculative execution beyond an unresolved branch can be done whether the branch is predicted taken or not-taken. An unresolved branch is a branch whose true taken or not-taken status has yet to be decided. Such branches are also known as outstanding branches.
Speculative execution and out-of-order execution are closely related, and the terms are sometimes used interchangeably without distinction. Nevertheless, the two concepts are distinct. Out-of-order execution is the execution (and implied completion) of an instruction stream in other than strict sequential order. Out-of-order execution is a form of "dynamic instruction scheduling" for circumventing pipeline stalls (bubbles). Speculative execution requires that the execution results be kept tentative until it is completely safe to permanently update the state of the processor. Speculative execution is always associated with either a history RAM, a "future" RAM, "relabeled" registers, or some similar arrangement. It is possible to perform carefully limited out-of-order execution that is not speculative. However, unrestricted out-of-order execution must be done speculatively, if a precise interrupt model is defined for the architecture. Out-of-order execution past unresolved branches must also be done speculatively, as improper operation would otherwise result on mispredicted branches.
Out-of-order execution is distinct from out-of-order issue, which is the issue (but not completion) of instructions in other than strict sequential order. It is possible to do in-order issue and out-of-order execution, and vice versa.
Speculative execution is also distinct from speculative issue. Speculative execution implies instruction completion and requires some means of tentatively storing the execution results. Speculative issue permits stalls related to control transfers and precise interrupts to be postponed until a latter pipeline stage than normally would be possible. As a result of the added delay, the hazard may be removed in time to avoid the stall. When a processor performs speculative issue past a branch, it may actually begin execution, but it doesn't execute to completion until after the associated predicted branch is resolved. This is because there is no means to back up the machine state should the branch be mispredicted. If the branch resolution occurs prior to the cycle in which the execution results for a speculatively issued instruction are scheduled to be written, the "execution" is no longer speculative. If the branch was correctly predicted, the result writing proceeds normally. If the branch was mispredicted, the pipeline is reset, "throwing away" the moot results. If the branch is not resolved in time, the pipeline must be stalled, because there is no means to restore the correct machine state should the branch be mispredicted. In a precise interrupt architecture, out-of-order speculatively issued instructions may be stalled from writing their results until it is determined that they may "safely" do so. That is, the results are written only when them is no possibility for an "intervening" interrupt. While many of the earlier mentioned superscalar pipelined processors perform speculative issue, it is believed that only the Motorola 88110 and the PowerPC 603 perform speculative execution to any extent.
The principles of out-of-order execution are well known in the art. As background, out-of-order execution in the IBM System/360 Model 91 was discussed in section 6.6.2 of Kogge. The January 1967 issue of the IBM Journal of Research and Development was devoted to the Model 91. More recently, the IBM Enterprise System/9000 520-based models performed speculative execution. J. L.Hennessy and D. A. Patterson provide an overview of out-of-order execution in chapter 6.
U.S. Pat. No. 5,226,126, ('126) PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, described speculative out-of-order execution in the system in which the instant invention is used, and is hereby incorporated by reference. The detailed description that follows will presume some degree of familiarity with '126.
U.S. Pat. No. 4,858,105 ('105) PIPELINED DATA PROCESSOR CAPABLE OF DECODING AND EXECUTING PLURAL INSTRUCTIONS IN PARALLEL, to Kuriyama et al., issued Aug. 15, 1989, teaches the optional execution of two instructions in parallel, including advancement of the instruction pointer. The pointer is advanced by a first instruction length, if only one instruction is executed, or is advanced by the sum of said first instruction length and a second instruction length, if two instructions are executed. However, '105 does not teach advancement of the instruction pointer in the context of speculative execution. As a result only one value for the next instruction pointer is produced, corresponding to executing either one or both instructions.
U.S. Pat. No. 5,204,953 ('953) ONE CLOCK ADDRESS PIPELINING IN SEGMENTATION UNIT, to Dixit, issued Apr. 20, 1993, discloses pipelined single-clock address generation for segment limit checking in an architecture compatible with that of the instant invention. Updating of the instruction pointer is not disclosed. Details of the segment limit check logic are not disclosed.
COMPUTER ARITHMETIC
Gerrit A. Blaauw describes carry-save adders (CSAs) in section 2-12 of "Digital System Implementation" (Prentice-Hall, 1976). Blaauw indicates that the CSA was mentioned by Babbage in 1837, by von Neumann in 1947, and used in 1950 in M.I.T.'s Whirlwind computer. J. L.Hennessy and D. A. Patterson discuss carry-save adders on pages A-42 and A-43.
In "A Suggestion for a Fast Multiplier" (IEEE Transactions on Electronic Computers EC-13:14-17, 1964), C. S. Wallace, indicates that "an expedient now quite commonly used" is to add three numbers using a CSA. If a set of more than three numbers are to be added, three of the set are first added using the CSA and the carry and sum are captured. The captured carry and sum and routed back to two of the tree inputs, and another number from the set is input to the third input. (Whenever the carry-outs generated by a CSA are subsequently added in another adder, an implicit one-bit left shift of the carry-bits is implemented via the wiring between the adders.) The process is repeated until all of the numbers in the set have been added. Finally, the sum and carry are added in a "conventional" carry-propagate adder (CPA). In "Computer Arithmetic: Principles, Architecture, and Design" (John Wiley & Sons, 1979, pp. 98-100), K. Hwang describes this same technique in greater detail. In particular, see FIG. 4.2. For a dedicated three-input adder, the CSA's carry and sum need not be captured, and can instead be routed directly into the CPA.
Wallace extended the use of CSAs from adding three-inputs to adding an arbitrary number of values simultaneously, while having only a single carry-propagate path. One application of the Wallace-tree (as it came to be known) is high-performance hardware multipliers. Generally, a Wallace-tree consists of successive levels of CSAs, each level reducing the number of values being added by 3:2, since each CSA takes three inputs and produces 2 outputs. At the bottom of the tree a CPA is used to add the last carry/sum pair. Wallace taught the omission of any latches within the tree. The degenerate case of a Wallace-tree, corresponding to a dedicated three-input adder, requires only a single level of CSA prior to a CPA.
In "Introduction to Arithmetic for Digital Systems Designers" (Holt, Rineheart and Winston, 1982, pp. 103-104), S. Waser and M. J. Flynn describe a three-input adder consisting of a CSA followed by a CPA that uses a carry-look-ahead. For small bit-widths or low performance applications, a ripple-carry CPA could be substituted for the carry-look-ahead CPA.
U.S. Pat. No. 4,783,757 ('757) THREE INPUT BINARY ADDER, to Krauskopf, issued Nov. 8, 1988, teaches a carry-save adder followed by carry-propagate adder for adding three operands of 32 bits. U.S. Pat. No. '757 teaches the use of a full adder at a 33rd bit position (bit&lt;32&gt;) of the carry-propagate adder for generating an overall carry. (There are 33 full adders in the CPA, overall.) U.S. Pat. No. '757 also discloses an alternate embodiment that describes a segment limit checking "adder." This limit check adder comprises a mostly 2-input carry-save adder with least significant bit (lsb) provisions for a third input having the values 0, 1, 2, or 3. The carry-save adder is followed by a carry-chain. For 32-bit operands, the carry-save adder uses 30, two-input, circuits for bits&lt;31..2&gt; (bits 31 through 2), one three-input circuit for bit&lt;1&gt;, and no circuit for bit&lt;0&gt;. The carry-chain has 32 (for bits&lt;31..0&gt;) carry-circuits corresponding to a full-adder, but the sum logic is not present. The 3-inputs (one being the carry-in) of the bit&lt;0&gt; carry-circuit are used for the lsb of the three operands being added. An OR gate, combining the bit&lt;31&gt; carries of the carry-chain and the carry-save circuits, generates the overall carry for the segment limit checking adder.
Blaauw describes a variety of fast adder techniques in chapter 2, using APL notation. Hennessy and Patterson discuss fast adder techniques in section A.8. Fast adder techniques, including conditional-sum methods, are covered in chapter 3 of Hwang and chapter 3 of Waser and Flynn. All of these texts cover carry-look-ahead.
Conventional arithmetic circuits are designed to deal with all possible input operands. The extent to which a priori restrictions on input operands has been exploited is limited. New techniques for implementing arithmetic circuits for special classes of inputs are needed to decrease circuit size and increase efficiency.