Performance of pipelined processors is severely limited by the time required to execute conditional branches. A processor normally fetches and executes instructions in a sequential fashion; i.e., the address of the instruction Ei+1 executed immediately after an instruction Ei (the successor of Ei) fetched from address n is found by adding the length of Ei to n. An unconditional branch is an instruction whose execution causes a transfer of control to an instruction at a non-sequential address. Thus the successor of a branch B is fetched from an arbitrary target address. In some computers, the target address of branch instruction B is contained within the instruction, while in others the target is formed by adding an offset contained within the instruction B to the address from which B itself was fetched.
A conditional branch instruction conditionally causes a transfer of control, based on testing some piece of data. Along with a specification of a target address, such an instruction contains a condition to be tested. This condition is typically one of a small set of algebraic properties of a number: the number is or is not zero, the number is or is not positive, the number is or is not negative, etc. If the condition is met, the branch is taken; i.e., the successor instruction is fetched from the target address of the branch. If the condition is not met, the successor instruction is the next instruction in sequence, just as for non-branch instructions.
Pipelined computers pass each instruction through a pipeline consisting of several processing stages, usually at least five. A new instruction can be entered into the pipeline during each clock cycle. As a consequence, a pipelined computer can have several instructions in different stages of execution simultaneously, thus maximizing the utilization of the hardware resources at each stage.
The performance degradation caused by conditional branches in pipelined computers arises when the branch is fetched before the algebraic conditions of the data to be tested have been determined. This phenomenon is worst in those computers in which the branch instruction itself specifies the location of the data to be tested. Evaluating the algebraic conditions is done only after several stages of the pipeline have been traversed. Since this cannot start until the branch instruction is fetched, the conditions to be tested are not known until several clock cycles after the branch is fetched. Since the location of the next instruction to be fetched cannot be determined for certain until the data have been tested, no instructions can be fetched for several clock cycles.
Branch prediction is an attempt to predict, immediately upon fetching a conditional branch, whether or not the branch will be taken, without waiting to determine the outcome of the test. In this way, instructions can continue to be fetched at full rate. If branches are predicted, it becomes necessary to validate the prediction and to recover from an incorrect prediction. If the prediction was incorrect, then all the instructions fetched after the incorrectly-predicted ("bad") branch were fetched in error, and so the effects of their execution must be reversed. Techniques for recording, validating, and repairing predicted branches are not the subject of the present invention.
Since all instructions fetched after a bad branch must be discarded, they represent wasted effort. Therefore the performance of the machine is directly related to the accuracy of branch predictions.
Branch prediction schemes can be either static or dynamic. In a static scheme, the branch instruction itself contains the prediction; this is typically supplied by the compiler that produced the program, based on the compiler having executed the program on a typical data set. Static prediction is possible only if the instruction set of the computer has been designed with that in mind. Most commercially-successful instruction sets do not provide facilities that allow static branch prediction.
Dynamic branch prediction uses information about the branch that is gathered by the hardware during program execution. The hardware can only "know" about past execution patterns of a given branch instruction and so must base its dynamic prediction on such information. Since conditional branches are quite frequent (as dense as one in every five instructions), the amount of history that can be stored for each cannot be very large without requiring a very large memory capacity. Typically branch prediction information is kept on only a small, but varying, subset of the branches in a program.
The correct execution history of a given branch instruction at any point in time during execution of a program can be represented as a sequence of binary symbols 1 and 0. This sequence tells whether the branch instruction was taken (1) or not taken (0). Each time a branch instruction is executed, the history of that branch is extended by adding a 1 or 0 to its end, depending on whether the correct (not necessarily the predicted) execution of the branch was taken or not.
A branch instruction's execution history can be partitioned into runs. A branch run is a sequence of consecutive 0's immediately preceded and followed by a 1, or vice versa. I.e., each symbol in the history is in exactly one run and each run consists of all 0's or all 1's. The length of a run is the number of symbols in it.
Prior art dynamic branch prediction mechanisms exploit the observation that for many branches in a program, all, or almost all, of the runs of 0's are of length one. These are usually branches that end loops. A loop is implemented typically by placing a conditional branch at the end of the sequence of instructions that constitute the body of the loop. The conditional branch tests the loop-ending condition and branches to the first instruction in the sequence that is the loop body if that condition is false. The loop is terminated if that branch is not taken. The next time that branch is executed will be the first execution in the next activation of the loop, which will be taken unless this activation terminates after one traversal. Thus there is a run consisting of a single 0 representing the loop termination. (Some compilers construct loops with a conditional branch at the beginning of the body rather than at the end. Such a loop is terminated by taking the branch. This loop construct gives rise to execution histories with runs consisting of a single 1.)
Prior art branch predictors base each prediction on two bits of stored history for each branch. These bits are the state of a four-state state machine (FIG. 1). The effect of this state machine is to predict that the branch will have the same outcome as the last run of length greater than one. Therefore, in the case of a loop that is always traversed more than once, so that its execution history has no run of two or more 0's, the prediction will be constant.
The prediction accuracy of this prior-art state machine is directly related to the lengths of the runs of 1's. If the average run length is n, then there is one incorrect prediction for every n correct predictions. Thus the efficiency is worse for shorter runs. The purpose of the invention is to improve the prediction accuracy for short-run-length branches.