1. Field of the Invention
The present invention relates generally to computer system central processors and more particularly to predicting outcomes of conditional branch instructions.
2. Description of the Background Art
Computer designers have developed a number of techniques to improve the performance of various computer architectures. These techniques include forms of memory caching and hardware parallelism, including pipelining.
Pipelined processors decompose the interpretation and execution of instructions into separate operations that can be performed in parallel, or simultaneously. Each processor stage of a pipelined processor can, ideally, complete one operation from an instruction during each machine cycle and pass the instruction on to the next stage. In theory, the effective speed of a P-stage pipelined processor is thus P times the speed of a non-pipelined equivalent, since pipelined processors need not wait until one instruction is completely finished before execution of the next instruction can begin.
Various practical limitations on pipeline performance can prevent a pipelined processor from achieving this theoretical improvement in performance. One of the most important of these limitations occurs when the sequence of instructions to be executed is not known in advance. In particular, the instruction to be executed next after a conditional branch instruction may not be known for certain until after the conditional branch is executed. In this case, the pipeline will have to wait, and performance suffers.
This application makes use of the following convention: conditional branch instructions test a condition specified by the instruction. If the condition is true, then the branch is xe2x80x9ctakenxe2x80x9d (T); that is, instruction execution begins at the new address specified by the instruction. If the condition is false, the branch is xe2x80x9cnot-takenxe2x80x9d (N) and instruction execution continues with the instruction sequentially following the branch instruction. Since most program code contains a large number of such branches, their impact is very significant. Avoiding branch condition delay penalties is critical to improving pipelined processor performance.
Branch prediction is the anticipatory designation of the branch condition. By predicting the direction of the conditional branch, the processor can, while waiting for branch condition resolution, begin or prepare to begin execution of the next instructions in that path. In other words, branch prediction mechanisms guide the pre-fetching or the conditional issuance of instructions in a particular path in an attempt to keep the pipeline full and free from stalls. Branch target prediction, i.e. prediction of specific address offsets or specific instructions to be executed, is not addressed by this invention.
Accurate prediction of branch instructions is vital to the efficient use of pipelines. Mispredicting a branch results in discarding much speculative work and delays execution of a program. If instructions in a wrong path have been fetched and decoded, those instructions must be flushed from the pipeline. The pipeline must then be loaded with new instructions corresponding to the correct path before the execution unit can resume processing. Conversely, since the correct instructions were not predicted and started early, an opportunity to advance is missed. Thus, a poor branch prediction scheme can have severe penalties that neutralize the potential parallelism advantages of a long processor pipeline.
Branch prediction techniques are typically categorized as static or dynamic. Static techniques make the same guess regarding branch direction each time a particular branch is encountered. One static branch prediction method simply assumes that all encountered branches follow a fixed assignment, i.e. they are either always xe2x80x9ctakenxe2x80x9d or always xe2x80x9cnot-taken.xe2x80x9d The validity of this assumption can vary greatly with the type of program being executed. For example, many branches are programmed merely for management of potential but rare error conditions, so for such branches it would usually be correct to predict that all branches are xe2x80x9cnot-taken.xe2x80x9d Another static method is to use only the direction of the branches to make a prediction. The branch is predicted to be xe2x80x9ctakenxe2x80x9d if the branch is backward, i.e. the target address is earlier in the program listing than the branch instruction; otherwise the branch is predicted to be xe2x80x9cnot-taken.xe2x80x9d This strategy detects loops in a program and works particularly well when loops are iterated many times, as in scientific programs with equation evaluation loops. Fairly high prediction accuracy is possible with static predictors for loop control branches, but the exit from the loop is incorrectly predicted by this strategy. Yet another static method is to use information from compilation and pre-execution of the program as a profile to guide branch prediction. Ideally, the compiler can assign a branch prediction to every branch in the program, but there are drawbacks to this approach, i.e. pre-execution takes time, and it is not widely used. In the applications where static predictors work well, the outcome of any one branch tends to be independent of the outcomes of other branches.
There are many workloads where control transfers are intensive and thus the relation between branches is not as simple as the situations described above. The outcomes of branch decisions for such applications are usually neither constant nor looping, but are strongly affected by their own past histories and by the outcomes of preceding branches. Static branch prediction methods are therefore generally not adequate for accurately predicting actual program behaviors, and in some cases can actually reduce the branch prediction accuracy below that achievable by mere chance.
Dynamic branch prediction schemes differ from static schemes in that they base their predictions on the actual run-time behaviors of program branches. The execution sequence that a program follows can vary in ways that cannot be predicted by static algorithms. Different input data during different program runs can cause differences in execution sequences that neither optimizing compilers nor static mechanisms can successfully predict. A branch might also execute consistently one way in one part of a given program run, but the other way in another part of the run, so only a branch predictor that adapts to these changes during execution can make accurate predictions. Although branch outcomes are variable, they are usually not the result of random activities; most of the time they are correlated with past branch behavior. By keeping track of the history of branch outcomes, it is possible to anticipate with a high degree of certainty which direction future branches will take, and therefore to optimize program execution. Dynamic branch predictors are popular because they can be implemented entirely in hardware and can therefore accurately predict branches without changes to the processor instruction set or to compiled programs. All of the various dynamic branch prediction methods that have been proposed use the history of previous branches to predict how a current branch will behave.
Bimodal Predictor
Most conditional branches behave in a bimodally biased manner; they are either xe2x80x9ctakenxe2x80x9d most of the time, or xe2x80x9cnot-takenxe2x80x9d most of the time. The assumption that the most recent branch directions represent the probable next branch direction is usually valid, so the past behavior of the branch can provide some predictability about the future behavior of the branch. In U.S. Pat. No. 4,370,711 by Smith, an array of 2-bit saturating up/down counters is proposed to store information about the recent history of each branch in a program. FIG. 1 is a block diagram of a typical bimodal predictor. Each counter 10 in the branch counter array 12 is addressed by the low order bits of the address of the branch instruction on line 14 in the program counter; building a full array addressed directly by all the program counter bits would be uneconomical. FIG. 2 shows a state diagram for a 2-bit saturating up/down counter. When a particular branch is xe2x80x9ctaken,xe2x80x9d the respective counter is incremented. Each time the branch is xe2x80x9cnot-taken,xe2x80x9d its counter is decremented. The counters saturate, i.e. they do not count above three or below zero. Counter saturation guarantees that the predictor can adapt relatively quickly to new programs, phases of execution, or input data, in contrast to simple static predictors. A count in the upper half of the range (10 or 11) predicts that the branch will be taken; a count in the lower half (00 or 01) predicts that it will not. Branches are thus binned into four categories, strongly-taken, weakly-taken, weakly not-taken, and strongly not-taken.
Smith observed empirically that a 2-bit counter provides an appropriate amount of damping to changes in branch direction. A 1-bit counter simply records the single most recently executed branch direction and does not average recent executions. The 2-bit counter captures more of the recent branch history, so the predictor is more tolerant of a branch going in an anomalous direction. For example, with the stream of branch executions . . . NNNTNNN . . . , the 1-bit predictor gives two mispredictions, the first when the branch is anomalously xe2x80x9ctaken,xe2x80x9d and the second when it is subsequently xe2x80x9cnot-taken.xe2x80x9d Use of a 2-bit counter results in only one incorrect prediction in this situation. Generally, the 2-bit counter""s branch prediction should not reverse during an extended bimodally biased sequence unless the branch goes the unlikely direction twice in a row. There are exceptional situations that can cause Smith""s 2-bit counter-based predictor to predict wrongly all the time, e.g. the alternating sequence TNTNTN . . . , when starting from initial state 01. Such situations are rare, though, so 3-bit or higher counters do not appear to offer any significant advantage over 2-bit counters, considering their additional hardware cost.
Another advantage of using a count becomes apparent when a collision occurs, i.e. more than one branch instruction happens to address the same location in the branch counter array. When this happens, a count tends to result in a xe2x80x9cvotexe2x80x9d among the branch instructions that map to the same index, and predictions are made according to the way the xe2x80x9cmajorityxe2x80x9d of the more recent decisions were made. This helps maintain high prediction accuracy, although not as high as if there were no collisions.
Local Predictor
A particular branch instruction will often execute in repetitive patterns during program execution. A loop control branch is a common example of this behavior. A loop control branch with three evaluations followed by an exit will have a branch history of the form TTTNTTTN . . . as the loop is evaluated (xe2x80x9ctakenxe2x80x9d) three times before the loop is exited (xe2x80x9cnot-takenxe2x80x9d). A dynamic branch predictor that tracks the pattern recently executed by a particular branch instruction (e.g. TTTN) can detect recurrences of such patterns and use them to alter its prediction accordingly. Such mechanisms are referred to as xe2x80x9clocalxe2x80x9d predictors, as only information local to each branch is used for prediction. This concept, devised independently by Yeh (Tse-Yu Yeh and Yale N. Patt, A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History, The 20th Annual International Symposium on Computer Architecture, May 16, 1993, at 257) and by Steely (U.S. Pat. No. 5,564,118), has many different implementations.
FIG. 3 is a block diagram of a typical local predictor. As in the bimodal counter, the local predictor uses an array 30 of saturating 2-bit up/down counters 32, and the prediction on line 34 is simply the most significant bit of a given counter. Unlike the bimodal counter, however, each counter in the local predictor array is indexed not only by the low order branch instruction address bits on line 36, but also by the pattern of directions on line 38 recently taken by that particular branch. Each branch instruction address can be thought of as an index into a first-level table, and a combination of history and address information acts as an index into a second-level table. The historical path pattern 39 is stored in an array of shifted values that is updated after the branch instruction is executed; the oldest history bit is shifted out and discarded. With the complete index (address and pattern) available, the local predictor can access a particular counter to pick off its most significant bit as the path prediction. Local predictors are significantly more accurate than bimodal predictors, but require more hardware to implement.
Global Predictor
The behavior of different branch instructions in a program is not always independent, but rather can be correlated, as taught by Pan in U.S. Pat. No. 5,553,253. The trail of program execution that has previously led to a particular branch instruction may be xe2x80x9cwell-worn,xe2x80x9d i.e. frequently followed, and therefore likely to be taken again. Predictors that track the recent history of all branch instruction outcomes to detect recurring paths of program execution are known as xe2x80x9cglobalxe2x80x9d predictors. FIG. 4 is a block diagram of a typical global predictor, which is similar in construction to a local predictor, except that instead of individual registers for each branch there is only one history register 40 global to all branches. Each counter in the array is indexed by the branch instruction address and by the global history register on line 42.
Global predictors can predict branch behaviors that other predictors cannot. In cases where the same variable is compared to different values at different steps in program execution, global predictors can use the history of initial comparisons to help predict subsequent comparisons. For example, in the following code,
if (x greater than 1) then y=12;
if (x greater than 2) then z=3;
if the first branch is xe2x80x9cnot-taken,xe2x80x9d the second branch will always also be xe2x80x9cnot-taken,xe2x80x9d so there is perfect branch correlation. If the first branch is xe2x80x9ctaken,xe2x80x9d there is no conclusive knowledge of the path the second branch will follow, but after some initialization period there is a historical bias that can be used for making a prediction.
Gshare Predictor
Often there are only a few paths taken to reach a particular branch. In this case, the branch instruction address and the global history register will be highly correlated and to that extent redundant. That is, if a system knows which branch it is executing, it usually has good evidence of how it got there. McFarling proposed a predictor, called a global shared index or xe2x80x9cgsharexe2x80x9d predictor, that can take advantage of this situation (Scott McFarling, Combining Branch Predictors, Technical Note TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993). As shown in FIG. 5, a typical gshare predictor uses the branch instruction address on line 50 XOR""ed with the global history register 52 to index the array of counters 54. This hashing allows using more history bits and more address bits with the same number of counters, improving global prediction accuracy.
Parallel Predictor
The different branch predictors described have different advantages. Global predictors work well when branches are correlated with their neighbors. Otherwise, a bimodal or a local predictor may be better. The bimodal predictor adapts quickly and is small because it retains only a very limited amount of information about each branch. Bimodal predictions are good for branches that are strongly biased one way or another, which is a very commonly occurring pattern in typical programs. Local predictors can be much larger than bimodal predictors because they retain and use much more history of branch behavior. A large local predictor is generally much more accurate than a bimodal predictor since it can detect more complex behavior patterns. However, a small local predictor can actually be worse than a bimodal predictor if there are too many collisions (same address bits used) between branches for entries in the branch history table. Bimodal predictors can suffer similar conflicts, but since counters are small it is relatively easy to minimize collisions by simply increasing the number of counters. The global predictor can detect correlated or dependent branches better than other predictors can, but it may need a very large counter array to handle all possible cases. Program size can also influence predictor performance; bimodal predictors work better with large programs, while local predictors work better with small programs.
Selection of the particular type of predictor with the best accuracy under the circumstances can increase overall prediction accuracy and therefore processor throughput. Each type of branch predictor described in the prior art has distinct advantages corresponding to distinct patterns of branch instruction behavior found in typical programs. Multiple predictors can be combined to match a given predictor to the particular pattern of program behavior to which it is best suited. FIG. 6 is a block diagram of one type of multiple predictor proposed by McFarling (Id. at 11). This predictor has two independent predictors 60, 62 operating in parallel, with an additional array of 2-bit saturating up/down counters 64 to keep track of which predictor is more accurate for the branches that share that counter. This second array of counters switches between the two predictors to select one for a final prediction. Unfortunately, the parallel predictor is plagued by redundant computation and relatively high required memory capacity. These problem areas are not limited to the parallel predictor, but rather illustrate the general arena in which multiple-predictor design can be improved.
The invention provides an improved accuracy branch prediction mechanism to minimize the time lost to erroneous predictions that necessitate both a purge and a reload of all affected pipelines in a processor. The present invention provides a serial branch predictor that includes a first component predictor operating according to a first algorithm to predict an action, and any number of subsequent component predictors operating according to alternate algorithms to predict the action. The first component predictor and the subsequent component predictors are coupled to each other serially, a predicted action of a preceding component being input to the subsequent component predictor. Such an arrangement provides a better prediction mechanism, since it serially combines multiple component predictors with varying characteristics to overrule the prediction from any prior component predictor if and only if an improvement in prediction accuracy is likely. Each subsequent stage therefore focuses on correction of predictions made by a prior stage. In the preferred embodiment, known as the SerialBLG predictor, the first predictor algorithm is bimodal, a second predictor algorithm is local, and a third predictor algorithm is global. Further, each stage is improved according to various methods.