Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor's basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages.
The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages. Pipelining is viewed as an architectural technique for improving performance over what can be achieved via process or circuit design improvements.
The increased throughput promised by the pipeline technique is easily achieved for sequential control flow. Unfortunately, programs experience changes in control flow as frequently as one out of every three executed instructions. Taken branch instructions are a principal cause of changes in control flow. Taken branches include both conditional branches that are ultimately decided as taken and unconditional branches. Taken branches are not recognized as such until the later stages of the pipeline. If the change in control flow were not anticipated, there would be instructions already in the earlier pipeline stages, which due to the change in control flow, would not be the correct instructions to execute. These undesired instructions must be cleared from each stage. In keeping with the pipeline metaphor, the instructions are said to be flushed from the pipeline.
The instructions to be first executed where control flow resumes following a taken branch are termed the branch target instructions (target instructions). The first of the target instructions is at the branch target address (target address). If the target instructions are not introduced into the pipeline until after the taken branch is recognized as such and the target address is calculated, there will be stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is used to describe this condition. The throughput of the processor suffers whenever such bubbles occur.
Branch Prediction Caches (BPCs), also known as Branch Target Buffers (BTBs), are designed to reduce the occurrence of pipeline bubbles by anticipating taken branches. BPCs store information about branches that have been previously encountered. An Associative Memory is provided in which an associatively addressed tag array holds the address (or closely related address) of recent branch instructions. The data fields associated with each tag entry may include information on the target address, the history of the branch (taken/not taken), and branch target instruction bytes. The history information may take the form of N-bits of state (N is typically 2), which allows an N-bit counter to be set up for each branch tracked by the BPC.
The fetch addresses used by the processor are coupled to the branch address tags. If a hit occurs, the instruction at the fetch address causing the hit is presumed to be a previously encountered branch. The history information is accessed and a prediction on the direction of the branch is made based on a predetermined algorithm. If the branch is predicted not taken, then the pipeline continues as usual for sequential control flow. If the branch is predicted taken, fetching is performed from the target address instead of the next sequential fetch address. If target instruction bytes were cached, then these bytes are retrieved directly from the BPC. Because of using a BPC, many changes in control flow are anticipated, such that the target instructions of taken branches contiguously follow such branches in the pipeline. When anticipated correctly, changes in control flow due to taken branches do not cause pipeline bubbles and the associated reduction in processor throughput. Such bubbles occur, only when branches are mispredicted.
Conventionally, instructions fetched from the predicted direction (either taken or not-taken) of a branch are not allowed to modify the state of the machine until the branch direction is resolved. Operations normally may only go on until time to write the results in a way that modifies the programmer visible state of the machine. If the branch is actually mispredicted, then the processor can flush the pipeline and begin anew in the correct direction, without any trace of having predicted the branch incorrectly. Further instruction issue must be suspended until the branch direction is resolved. A pipeline interlock is thus provided to handle this instruction dependency. Waiting for resolution of the actual branch direction is thus another source of pipeline bubbles.
It is possible to perform speculative execution (also known as conditional, or out-of-order execution) past predicted branches, if additional state is provided for backing up the machine state upon mispredicted branches. In machines performing speculative execution, branch prediction hardware must be designed to account for the possibility that a branch will be resolved as mispredicted. Branch prediction hardware is more complex as a result. Speculative execution beyond an unresolved branch can be done whether the branch is predicted taken or not-taken. An unresolved branch is a branch whose true taken or not-taken status has yet to be decided. Such branches are also known as outstanding branches.
Pipelining is extensively examined in "The Architecture of Pipelined Computers," by Peter M. Kogge (McGraw-Hill, 1981). A more recent treatment is provided by chapter 6 of "Computer Architecture, A Quantitative Approach," by J. L. Hennessy and D. A. Patterson (Morgan Kaufmann, 1990). Branch prediction and the use of a BTB are taught in section 6.7 of the Hennessy text. The Hennessy text chapter references provide pointers to several notable pipelined machines and for several contemporary papers on reducing branch delays. D. R. Ditzel and H. R. McLellan, "Branch folding in the CRISP microprocessor: Reducing the branch delay to zero," Proceedings of the 14th Symposium on Computer Architecture, June 1987, Pittsburgh, pg. 2-7, provides a short historical overview of hardware branch prediction. J. K. F. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design," IEEE Computer, Vol. 17, January 1984, pg. 6-22, provides a thorough introduction to branch prediction. Two recent excellent reports include "Branch Strategy Taxonomy and Performance Models," by Harvey G. Cragon (IEEE Computer Society Press, 1992) and "Survey of Branch Prediction Strategies," by C. O. Stjernfeldt, E. W. Czeck, and D. R. Kaeli (Northeastern University technical report CE-TR-93-05, Jul. 28, 1993).
The principles of out-of-order execution are also well known in the art. As background, out-of-order execution in the IBM System/360 Model 91 is discussed in section 6.6.2 of Kogge. The January 1967 issue of the IBM Journal of Research and Development was devoted to the Model 91. U.S. Pat. No. 5,226,126, ('126) PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, describes speculative execution in the system in which the instant invention is used, and is hereby incorporated by reference.
U.S. Pat. No. 5,093,778, ('778) INTEGRATED SINGLE STRUCTURE BRANCH PREDICTION CACHE, to Favor et al., issued Mar. 3, 1992, which is assigned to the assignee of the present invention, teaches the implementation of the various components comprising a branch prediction cache as one integrated structure, and is hereby incorporated by reference. An integrated structure provides for reduced interconnect delays and lower die costs, due to smaller size. The '778 BPC was designed for use in a processor that uses out-of-order (speculative) execution.
"Improving the Accuracy of Dynamic Branch Prediction using Branch Correlation," by Shien-Tai Pan et al., ACM ASPLOS V Conference Proceedings, June 1992, pg. 76-84, teaches the use of correlation-based branch prediction tables. (This article appears to be an abridged version of "Correlation-Based Branch Prediction," Technical Report, UT-CERC-TR-JTR91-01, University of Texas at Austin, August, 1991.) Correlation-based branch prediction tables offer the promise of improved branch prediction accuracy for integer workloads. In correlation-based branch prediction tables, the address used to access the branch prediction table has two parts. One part is obtained from a portion (e.g., the least significant portion) of the branch address. A second part is obtained from a shift register that maintains the taken/not-taken history of the most recent branches.
The Pan et al. article reported simulation results for traces obtained from 3 floating-point and 4 integer SPEC benchmarks running on an IBM RISC System/6000. Comparison of a non-correlation counter-based BPT scheme was made against an 8-bit shift register for these benchmarks. Comparison of a non-correlation counter, a 5-bit shift register correlation scheme, and a 10-bit shift register correlation scheme, over a large range of table entries, was made for one of the integer benchmarks. Finally, a non-correlation counter scheme was compared to a 15-bit shift register "degenerate" scheme, in which no branch address bits were used. It was concluded that increasing the table size above 2K entries was not particularly beneficial and that a shift register of 5 to 8-bits would offer the "best improvement in accuracy" over a non-correlation counter scheme.
Beyond the trace-driven simulation evaluation approach described in the article, Pan et al. does not teach how to select the fixed shift-register size for other processor architectures or other instruction mixes. The selection of the fixed shift-register size is thus a problem for designers wanting to use the Pan correlation-based BPT scheme in other processor architectures. The SPEC benchmarks may not typify a typical instruction mix on the design architecture. A representative mix may not be practical to obtain, or its evaluation may not be practical due to the design schedule. Also, substantially different instruction mixes may be run by different users of a processor, or at different times by the same user. The designers face the risk that the fixed value chosen may not work out well in production.
Pan et al. do not mention the use of branch correlation based branch prediction with a conventional branch prediction cache. Thus there is no teaching of whether there is any advantage to using both techniques in some combination.
Pan et al. do not mention the use of branch correlation based branch prediction with instruction decode information. Thus there is no teaching of whether there is any advantage to using information about the kind of branch combined with the branch history information.
Pan et al. do not mention the use of branch correlation based branch prediction with speculative execution. Thus there is no teaching of how a correlation based scheme should be adapted for use in a processor that performs speculative execution.
Stjernfeldt et al. mentions an article by T. Yeh and Y. N. Part, "Alternative Implementations of Two-level Adaptive Branch Prediction," Proceedings to the 19th Annual International Symposium on Computer Architecture, pages 124-134, May, 1992, and describes the correlation and the two-level adaptive techniques as being closely related. These two techniques are classified and compared within a broader collection of related branch prediction techniques in a second article by T. Yeh and Y. N. Part, "A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History," Proceedings to the 20th Annual International Symposium on Computer Architecture, pages 257-266, May, 1993. The term "adaptive" in the Yeh et al. articles is synonymously used for "dynamic," and merely connotes that the taken or not-taken prediction for each branch is adapted according to various aspects of the past behavior of the executing program. The prediction is an output of the prediction algorithm as embodied in the prediction hardware. While the prediction adapts to the program behavior according to the prediction algorithm, the prediction hardware and algorithm themselves are invariant with program behavior. There is no teaching in the Yeh et al. articles or the Pan et al. article of reconfiguring the branch prediction hardware in dynamic response to program behavior or under software control.
The first Yeh et al. article also describes the use of opcode information to define sets of branch history information for purposes of addressing. Again, the prediction is an output of the prediction algorithm as embodied in the prediction hardware. While opcode information is used to address different sets of history information, the prediction hardware and algorithm themselves are invariant with instruction execution. There is no teaching in the Yeh et al. article of reconfiguring the branch prediction hardware in dynamic response to instruction decode information.