The invention relates to reducing pipeline delays in high performance processors by anticipating taken branches through branch prediction. More particularly, the invention relates to optimizing branch prediction accuracy through configurable branch prediction hardware. The invention further relates to the use of a branch prediction in a processor that performs speculative execution. The invention also relates to combining correlation-based branch prediction with information obtained from a conventional branch prediction cache or from knowledge of the type of branch gained from the instruction decoder.
Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor""s basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages.
The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages. Pipelining is viewed as an architectural technique for improving performance over what can be achieved via process or circuit design improvements.
The increased throughput promised by the pipeline technique is easily achieved for sequential control flow. Unfortunately, programs experience changes in control flow as frequently as one out of every three executed instructions. Taken branch instructions are a principal cause of changes in control flow. Taken branches include both conditional branches that are ultimately decided as taken and unconditional branches. Taken branches are not recognized as such until the later stages of the pipeline. If the change in control flow were not anticipated, there would be instructions already in the earlier pipeline stages, which due to the change in control flow, would not be the correct instructions to execute. These undesired instructions must be cleared from each stage. In keeping with the pipeline metaphor, the instructions are said to be flushed from the pipeline.
The instructions to be first executed where control flow resumes following a taken branch are termed the branch target instructions (target instructions). The first of the target instructions is at the branch target address (target address). If the target instructions are not introduced into the pipeline until after the taken branch is recognized as such and the target address is calculated, there will be stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is used to describe this condition. The throughput of the processor suffers whenever such bubbles occur.
Branch Prediction Caches (BPCs), also known as Branch Target Buffers (BTBs), are designed to reduce the occurrence of pipeline bubbles by anticipating taken branches. BPCs store information about branches that have been previously encountered. An Associative Memory is provided in which an associatively addressed tag array holds the address (or closely related address) of recent branch instructions. The data fields associated with each tag entry may include information on the target address, the history of the branch (taken/not taken), and branch target instruction bytes. The history information may take the form of N-bits of state (N is typically 2), which allows an N-bit counter to be set up for each branch tracked by the BPC.
The fetch addresses used by the processor are coupled to the branch address tags. If a hit occurs, the instruction at the fetch address causing the hit is presumed to be a previously encountered branch. The history information is accessed and a prediction on the direction of the branch is made based on a predetermined algorithm. If the branch is predicted not taken, then the pipeline continues as usual for sequential control flow. If the branch is predicted taken, fetching is performed from the target address instead of the next sequential fetch address. If target instruction bytes were cached, then these bytes are retrieved directly from the BPC. Because of using a BPC, many changes in control flow are anticipated, such that the target instructions of taken branches contiguously follow such branches in the pipeline. When anticipated correctly, changes in control flow due to taken branches do not cause pipeline bubbles and the associated reduction in processor throughput. Such bubbles occur, only when branches are mispredicted.
Conventionally, instructions fetched from the predicted direction (either taken or not-taken) of a branch are not allowed to modify the state of the machine unit the branch direction is resolved. Operations normally may only go on until time to write the results in a way that modifies the programmer visible state of the machine. If the branch is actually mispredicted, then the processor can flush the pipeline and begin anew in the correct direction, without any trace of having predicted the branch incorrectly. Further instruction issue must be suspended until the branch direction is resolved. A pipeline interlock is thus provided to handle this instruction dependency. Waiting for resolution of the actual branch direction is thus another source of pipeline bubbles.
It is possible to perform speculative execution (also known as conditional, or out-of-order execution) past predicted branches, if additional state is provided for backing up the machine state upon mispredicted branches. In machines performing speculative execution, branch prediction hardware must be designed to account for the possibility that a branch will be resolved as mispredicted. Branch prediction hardware is more complex as a result. Speculative execution beyond an unresolved branch can be done whether the branch is predicted taken or not-taken. An unresolved branch is a branch whose true taken or not-taken status has yet to be decided. Such branches are also known as outstanding branches.
Pipelining is extensively examined in xe2x80x9cThe Architecture of Pipelined Computers,xe2x80x9d by Peter M. Kogge (McGraw-Hill, 1981). A more recent treatment is provided by chapter 6 of xe2x80x9cComputer Architecture, A Quantitative Approach,xe2x80x9d by J. L. Hennessy and D. A. Patterson (Morgan Kaufmann, 1990). Branch prediction and the use of a BTB are taught in section 6.7 of the Hennessy text. The Hennessy text chapter references provide pointers to several notable pipelined machines and for several contemporary papers on reducing branch delays. D. R. Ditzel and H. R. McLellan, xe2x80x9cBranch folding in the CRISP microprocessor: Reducing the branch delay to zero,xe2x80x9d Proceedings of the 14th Symposium on Computer Architecture, June 1987, Pittsburgh, pg. 2-7, provides a short historical overview of hardware branch prediction. J. K. F. Lee and A. J. Smith, xe2x80x9cBranch Prediction Strategies and Branch Target Buffer Design,xe2x80x9d IEEE Computer, Vol. 17, January 1984, pg. 6-22, provides a thorough introduction to branch prediction. Two recent excellent reports include xe2x80x9cBranch Strategy Taxonomy and Performance Models,xe2x80x9d by Harvey G. Cragon (IEEE Computer Society Press, 1992) and xe2x80x9cSurvey of Branch Prediction Strategies,xe2x80x9d by C. O. Stjernfeldt, E. W. Czeck, and D. R. Kaeli (Northeastern University technical report CE-TR-93-05, Jul. 28, 1993).
The principles of out-of-order execution are also well known in the art. As background, out-of-order execution in the IBM System/360 Model 91 is discussed in section 6.6.2 of Kogge. The January 1967 issue of the IBM Journal of Research and Development was devoted to the Model 91. U.S. Pat. No. 5,226,126, (""126) PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, describes speculative execution in the system in which the instant invention is used, and is hereby incorporated by reference.
U.S. Pat. No. 5,093,778, (""778) INTEGRATED SINGLE STRUCTURE BRANCH PREDICTION CACHE, to Favor et al., issued Mar. 3, 1992, which is assigned to the assignee of the present invention, teaches the implementation of the various components comprising a branch prediction cache as one integrated structure, and is hereby incorporated by reference. An integrated structure provides for reduced interconnect delays and lower die costs, due to smaller size. The ""778 BPC was designed for use in a processor that uses out-of-order (speculative) execution.
xe2x80x9cImproving the Accuracy of Dynamic Branch Prediction using Branch Correlation, by Shien-Tai Pan et al., ACM ASPLOS V Conference Proceedings, June 1992, pg. 76-84, teaches the use of correlation-based branch prediction tables. (This article appears to be an abridged version of xe2x80x9cCorrelation-Based Branch Prediction,xe2x80x9d Technical Report, UT-CERC-TR-JTR91-01, University of Texas at Austin, August, 1991.) Correlation-based branch prediction tables offer the promise of improved branch prediction accuracy for integer workloads. In correlation-based branch prediction tables, the address used to access the branch prediction table has two parts. One part is obtained from a portion (e.g., the least significant portion) of the branch address. A second part is obtained from a shift register that maintains the taken/not-taken history of the most recent branches.
The Pan et al. article reported simulation results for traces obtained from 3 floating-point and 4 integer SPEC benchmarks running on an IBM RISC System/6000. Comparison of a non-correlation counter-based BPT scheme was made against an 8-bit shift register for these benchmarks. Comparison of a non-correlation counter, a 5-bit shift register correlation scheme, and a 10-bit shift register correlation scheme, over a large range of table entries, was made for one of the integer benchmarks. Finally, a non-correlation counter scheme was compared to a 15-bit shift register xe2x80x9cdegeneratexe2x80x9d scheme, in which no branch address bits were used. It was concluded that increasing the table size above 2K entries was not particularly beneficial and that a shift register of 5 to 8-bits would offer the xe2x80x9cbest improvement in accuracyxe2x80x9d over a non-correlation counter scheme.
Beyond the trace-driven simulation evaluation approach described in the article, Pan et al. does not teach how to select the fixed shift-register size for other processor architectures or other instruction mixes. The selection of the fixed shift-register size is thus a problem for designers wanting to use the Pan correlation-based BPT scheme in other processor architectures. The SPEC benchmarks may not typify a typical instruction mix on the design architecture. A representative mix may not be practical to obtain, or its evaluation may not be practical due to the design schedule. Also, substantially different instruction mixes may be run by different users of a processor, or at different times by the same user. The designers face the risk that the fixed value chosen may not work out well in production.
Pan et al. do not mention the use of branch correlation based branch prediction with a conventional branch prediction cache. Thus there is no teaching of whether there is any advantage to using both techniques in some combination.
Pan et al. do not mention the use of branch correlation based branch prediction with instruction decode information. Thus there is no teaching of whether there is any advantage to using information about the kind of branch combined with the branch history information.
Pan et al. do not mention the use of branch correlation based branch prediction with speculative execution. Thus there is no teaching of how a correlation based scheme should be adapted for use in a processor that performs speculative execution.
Stjernfeldt et al. mentions an article by T. Yeh and Y. N. Patt, xe2x80x9cAlternative Implementations of Two-level Adaptive Branch Prediction,xe2x80x9d Proceedings to the 19th Annual International Symposium on Computer Architecture, pages 124-134, May, 1992, and describes the correlation and the two-level adaptive techniques as being closely related. These two techniques are classified and compared within a broader collection of related branch prediction techniques in a second article by T. Yeh and Y. N. Patt, xe2x80x9cA Comparison of Dynamic Branch Predictors that use Two Levels of Branch History,xe2x80x9d Proceedings to the 20th Annual International Symposium on Computer Architecture, pages 257-266, May, 1993. The term xe2x80x9cadaptivexe2x80x9d in the Yeh et al. articles is synonymously used for xe2x80x9cdynamic,xe2x80x9d and merely connotes that the taken or not-taken prediction for each branch is adapted according to various aspects of the past behavior of the executing program. The prediction is an output of the prediction algorithm as embodied in the prediction hardware. While the prediction adapts to the program behavior according to the prediction algorithm, the prediction hardware and algorithm themselves are invariant with program behavior. There is no teaching in the Yeh et al. articles or the Pan et al. article of reconfiguring the branch prediction hardware in dynamic response to program behavior or under software control.
The first Yeh et al. article also describes the use of opcode information to define sets of branch history information for purposes of addressing. Again, the prediction is an output of the prediction algorithm as embodied in the prediction hardware. While opcode information is used to address different sets of history information, the prediction hardware and algorithm themselves are invariant with instruction execution. There is no teaching in the Yeh et al. article of reconfiguring the branch prediction hardware in dynamic response to instruction decode information.
In a first aspect of the invention, branch prediction hardware, comprising logic and interconnect, is configurable via a control line to alter the manner in which the branch prediction is generated. The configuration can be done programmatically in software. Or, the configuration can be done by hardware in response to processor events. Such processor events include the loading of the CS register and changes in the instruction workload.
In a second aspect of the invention, the directions of a plurality of branches are predicted based partly on resolved branch history information. Tentative branch history information is then stored for each of the predicted branches. When a predicted branch is resolved, the resolved branch history information is updated based on the stored tentative branch history information for the branch most recently resolved. Additionally, the predictions may be partly based on preceding unresolved branch predictions if any are outstanding.
In a third aspect of the invention, Hit/Miss information from a Branch Prediction Cache (BPC) can optionally be used in formulating the next state value of an addressed two-bit counter stored in a correlation-based branch history table. Since a Miss in the BPC may indicate that this branch has not been encountered recently, whatever state currently exists can be optionally forced to a state that is based solely on whether the branch is resolved taken or not. This feature may be enabled and disabled under software control.
In a fourth aspect of the invention, information from the instruction decoder is optionally used to override the correlation-based branch history table based prediction for select branch instructions. This feature may be enabled and disabled under software or hardware control.