1. Field of the Invention
This invention relates in general to the field of microprocessors, and more particularly to a method and apparatus for performing dynamic branch prediction by speculatively updating global branch history information.
2. Description of the Related Art
Computer instructions are typically stored in successive addressable locations within a memory. When processed by a Central Processing Unit (CPU), the instructions are fetched from consecutive memory locations and executed. Each time an instruction is fetched from memory, a program counter within the CPU is incremented so that it contains the address of the next instruction in the sequence. This is the next sequential instruction pointer, or NSIP. Fetching of an instruction, incrementing of the program counter, and execution of the instruction continues linearly through memory until a program control instruction is encountered.
A program control instruction, when executed, changes the address in the program counter and causes the flow of control to be altered. In other words, program control instructions specify conditions for altering the contents of the program counter. The change in the value of the program counter as a result of the execution of a program control instruction causes a break in the sequence of instruction execution. This is an important feature in digital computers, as it provides control over the flow of program execution and a capability for branching to different portions of a program. Examples of program control instructions include Jump, Test and Jump conditionally, Call, and Return.
A Jump instruction causes the CPU to unconditionally change the contents of the program counter to a specific value, i.e., to the target address for the instruction where the program is to continue execution. A Test and Jump conditionally causes the CPU to test the contents of a status register, or possibly compare two values, and either continue sequential execution or jump to a new address, called the target address, based on the outcome of the test or comparison. A Call instruction causes the CPU to unconditionally jump to a new target address, but also saves the value of the program counter to allow the CPU to return to the program location it is leaving. A Return instruction causes the CPU to retrieve the value of the program counter that was saved by the last Call instruction, and return program flow back to the retrieved instruction address.
In early microprocessors, execution of program control instructions did not impose significant processing delays because such microprocessors were designed to execute only one instruction at a time. If the instruction being executed was a program control instruction, by the end of execution the microprocessor would know whether it should branch, and if it was supposed to branch, it would know the target address of the branch. Thus, whether the next instruction was sequential, or the result of a branch, it would be fetched and executed.
Modern microprocessors are not so simple. Rather, it is common for modern microprocessors to operate on several instructions at the same time, within different blocks or pipeline stages of the microprocessor. Hennessy and Patterson define pipelining as, xe2x80x9can implementation technique whereby multiple instructions are overlapped in execution.xe2x80x9d Computer Architecture: A Quantitative Approach, 2nd edition, by John L. Hennessy and David A. Patterson, Morgan Kaufmann Publishers, San Francisco, Calif., 1996. The authors go on to provide the following excellent illustration of pipelining:
A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, though on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of the different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipexe2x80x94instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.
Thus, as instructions are fetched, they are introduced into one end of the pipeline. They proceed through pipeline stages within a microprocessor until they complete execution. In such pipelined microprocessors it is often not known whether a branch instruction will alter program flow until it reaches a late stage in the pipeline. But, by this time, the microprocessor has already fetched other instructions and is executing them in earlier stages of the pipeline. If a branch causes a change in program flow, all of the instructions in the pipeline that followed the branch must be thrown out. In addition, the instruction specified by the target address of the branch instruction must be fetched. Throwing out the intermediate instructions, and fetching the instruction at the target address creates processing delays in such microprocessors.
To alleviate this delay problem, many pipelined microprocessors use branch prediction mechanisms in an early stage of the pipeline that predict the outcome of branch instructions, and then fetch subsequent instructions according to the branch prediction. Branch prediction schemes are commonly classified as either static or dynamic branch prediction schemes.
With a static branch predictor, the prediction remains the same for a given branch instruction throughout the entire execution of the program in which the branch instruction is contained. That is, if the static branch predictor predicts a given branch will be taken the first time the branch instruction is executed, the static branch predictor will predict the branch will be taken every time the branch instruction is executed throughout the execution of the program. Thus, the prediction made by a static branch predictor does not depend upon the dynamic behavior of the branch instruction.
In contrast, dynamic branch predictors keep a history of the outcome of branch instructions as a program executes and make predictions based upon the history. Dynamic branch predictors are effective because of the repetitive outcome patterns that branch instructions exhibit. Dynamic branch prediction implies that the prediction will change if the branch changes its behavior while the program is running. One time the branch instruction executes the dynamic branch predictor may predict the branch will be taken. But the next time the branch instruction executes, the dynamic branch predictor may predict the branch will not be taken, particularly if the branch was not taken the previous time.
Various dynamic branch prediction schemes have been proposed. See xe2x80x9cA System Level Perspective on Branch Architecture Performancexe2x80x9d, by Brad Calder, Dirk Grunwald and Joel Emer, from Proceedings of MICRO-28, Nov. 29-Dec. 1, 1995 at Ann Arbor, Mich. Also see xe2x80x9cAlternative Implementations of Two-Level Adaptive Branch Predictionxe2x80x9d, by Tse-Yu Yeh and Yale N. Patt, from Proceedings of the 19thAnnual Symposium on Computer Architecture, ACM, New York, N.Y., 1992 incorporated by reference herein.
Perhaps the simplest dynamic branch prediction scheme is a simple array of one-bit storage elements, commonly referred to as a branch history table (BHT). The address of the branch instruction (or some portion thereof) whose outcome is being predicted is used to index into the BHT. The bit output by the BHT indicates the outcome of the last execution of the branch instruction (i.e., taken or not taken) and is used to predict the outcome of the current execution of the branch instruction. Each time the branch is executed, the BHT is updated with the outcome.
Improvements upon the simple scheme described have also been made. For example, a BHT may have more than one bit of history. Two-bit up-down saturating counters have been used as the contents of a BHT. Another improvement is commonly referred to as a Branch Target Buffer (BTB), used by the Pentium(copyright) processor for example. A BTB stores the last target address of the previously executed branch instruction that aliases into the same BTB location. Some BTB""s store more than one target address if more than one branch instruction aliased to the same BTB location. Storing the target address eliminates the time required to calculate the target address of the branch instruction.
To the extent that the schemes described above use the branch instruction address to index into the BHT or BTB they are gathering a history of that particular branch instruction, as opposed to other branch instruction outcomes. This history is commonly referred to as a local branch history since it only records the history of a particular branch instruction after it is executed.
It has been observed that the behavior of a given branch instruction in a program is often correlated with the behavior of other branch instructions in the program. Therefore, when making a prediction about a given branch instruction, some branch predictors use not only the history of that particular branch instruction, but additionally use the behavior of other branch instructions to make a prediction about the current branch instruction. The history of the behavior of other branch instructions is commonly referred to as a global branch history, or global history.
A branch predictor that uses global branch history is commonly referred to as a correlating predictor, or two-level predictor. See Computer Architecture: A Quantitative Approach, p. 267. Various means of employing global branch history have also been described in the aforementioned references. For example, the PowerPC(copyright) 604 microprocessor employs a 2-bit up-down saturating counter BTB that is indexed by the exclusive OR of the lower n address bits of the branch instruction and a branch history pattern (BHP). The BHP is an n-bit shift register that stores the outcome of the last n branches.
However, in a pipelined microprocessor a situation may arise where a conditional branch instruction reaches an early stage in the pipeline where a prediction of its outcome needs to be made, but the outcome of a previous conditional branch instruction further down the pipeline has not yet been resolved. This situation may occur if the two branch instructions are relatively close together in the instruction stream. In this situation, the microprocessor designer is faced with a dilemma: stall the pipeline until the first branch is resolved, or make the second prediction with the old, incorrect global branch history.
If the designer chooses the former solution, the processor throughput is obviously diminished due to the wasted clock cycles while the second branch instruction waits for the resolution of the first. If the designer chooses the latter solution, the behavior of the first branch instruction will not be included in the history used to predict the behavior of the second branch instruction.
In the latter case, the accuracy of the second branch prediction will likely be worse than it would have been with the updated information of the resolution of the first branch. The accuracy may be adversely affected particularly if there is a dependency between the first and second branch instructions. In fact, if there is a dependency between the next series of branch instructions, the absence of the first branch prediction may adversely affect the prediction accuracy of the next several branches. As previously discussed, a decrease in branch prediction accuracy may adversely impact processor throughput.
These problems are exacerbated as microprocessor technology grows toward deeper pipelines. In the first instance, the probability that a first branch will not have been resolved before a second branch must be predicted increases as the number of stages between branch prediction and branch outcome resolution increases. In the second instance, the number of wasted clock cycles increases as the number of stages between branch prediction and branch outcome resolution increases.
Therefore, what is needed is an apparatus and method that predicts the outcome of a current branch instruction with the benefit of the prediction of the previous branch instruction if the actual outcome of the previous branch instruction has not yet been determined.
To address the above-detailed deficiencies, it is an object of the present invention to provide a more accurate method and apparatus and microprocessor for predicting the outcomes of conditional branch instructions that are close in proximity.
Accordingly, in the attainment of the aforementioned object, it is a feature of the present invention to provide an apparatus for predicting the outcome of branch instructions in a microprocessor. The apparatus includes a storage element configured to store a global history of previous branch instruction outcomes of a plurality of branch instructions and branch control coupled to the storage element. The branch control includes a prediction output for making a prediction of an outcome of a branch instruction based on the global history stored in the storage element. Prior to resolution of the outcome of the branch instruction, the branch control updates the global history in the storage element in response to the prediction.
An advantage of the present invention is that a more accurate prediction is made because the outcome of an earlier, close branch instruction is included in the prediction of the later branch instruction. Another advantage of the present invention is that the more accurate prediction is made without stalling the pipeline to wait for the earlier branch instruction to be resolved.
In another aspect, it is a feature of the present invention to provide a microprocessor capable of performing branch prediction. The microprocessor includes instruction fetch logic that fetches a branch instruction and execution logic coupled to the instruction fetch logic. The execution logic resolves an outcome of the branch instruction. The microprocessor also includes a branch predictor coupled to the instruction fetch logic and the execution logic. The branch predictor includes a storage element that stores a global history of previous branch instruction outcomes of a plurality of branch instructions. The branch predictor also includes branch control coupled to the storage element. The branch control includes a prediction output for making a prediction of an outcome of the branch instruction based on the global history stored in the storage element. The instruction fetch logic fetches a next instruction in response to the prediction. Prior to resolution of the outcome of the branch instruction by the execution logic, the branch control updates the global history in the storage element in response to the prediction.
In yet another aspect, it is a feature of the present invention to provide a method for performing branch prediction in a microprocessor. The method includes predicting an outcome of a branch instruction based on a global history of previous branch instruction outcomes of a plurality of branch instructions stored in a storage element and updating the global history in the storage element in response to predicting the outcome of the branch instruction prior to resolution of the outcome of the branch instruction.