Processors, including central processing units (CPUs) and graphical processing units (GPUs), are utilized in various applications. A standard configuration is to couple a processor with a storage unit, such as a cache, a system memory, or the like. Processors may execute a fetch operation to fetch instructions from the storage unit as needed. A processor pipeline includes several stages for processing instructions. In one implementation, a four stage pipeline may be used, and includes a fetch stage, a decode stage, an execution stage, and a write-back stage. Instructions progress through the pipeline stages in order.
To speed up the operation of the processor, it is desirable to have a full pipeline. One way of filling the pipeline is to fetch subsequent instructions while previous instructions are being processed. To be able to fetch ahead several instructions, a branch predictor may be used. A branch predictor predicts the direction of a branch instruction (i.e., taken or not-taken) and the branch target address before the branch instruction reaches the execution stage in the pipeline.
This is known as “pre-fetching” an instruction and “speculatively executing” the instruction. An instruction is speculatively executed because it is not known whether the prediction is correct until the branch instruction reaches the execution stage. Although pre-fetching and speculatively executing the instructions without knowing the actual direction of the branch instruction may result in speeding up the instruction processing, it may have the opposite effect and may result in stalling the pipeline if branch directions are mispredicted. If a branch misprediction occurs, the pipeline needs to be flushed and the instructions from the correct branch direction are executed. This may severely impact the performance of the system.
Several different types of branch predictors have been used. A bimodal predictor makes a prediction based on recent history of a particular branch's execution, and provides a prediction of taken or not-taken. A global predictor makes a prediction based upon recent history of all the branches' execution, not just the particular branch of interest. A two-level adaptive predictor with a globally shared history buffer, a pattern history table, and an additional local saturating counter may also be used, such that the outputs of the local predictor and the global predictor are XORed with each other to provide a final prediction. More than one prediction mechanism may be used simultaneously, and a final prediction is made based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors.
FIG. 1 is a block diagram of an existing Level 1 branch predictor 100. The branch predictor 100 includes a first predictor (P1) 102, a second predictor (P2) 104, a multiplexer (mux) 106, and a chooser 108. The program counter 110 (which is the address of the branch being predicted) and other inputs 112 are evaluated by both the first predictor 102 and the second predictor 104, and each makes its own prediction.
The program counter 110 is also supplied as an input to the chooser 108, which uses the program counter 110 to determine which predictor (either the first predictor 102 or the second predictor 104) is more accurate. The chooser 108 makes a prediction choice 114, which is supplied as the selector to the multiplexer 106. The selected predictor is used as the prediction 116 of the branch predictor 100.
FIG. 2 is a block diagram of another existing Level 1 branch predictor 200. In one implementation, the Level 1 predictor 200 may be a McFarling hybrid predictor. The branch predictor 200 is similar in construction to the branch predictor 100, but with a different implementation for some of the components. The branch predictor 200 includes a first predictor 202 (implemented as an array of bimodal counters), a second predictor 204 (implemented as an array of bimodal counters), a multiplexer (mux) 206, and a bimodal chooser 208. Each predictor 202, 204 makes its own prediction. The second predictor 204 includes an XOR unit 210 and an array of bimodal counters 212.
The program counter 220 (e.g., the branch address) is supplied as an input to the first predictor 202, the second predictor 204, and the chooser 208. The first predictor 202 bases its prediction on a saturating bimodal two bit counter, indexed by the low order address bits of the program counter 220.
The global history 222 keeps a history of the direction taken by the most recent N branches (indexed by the branch address), and is supplied as an input to the second predictor 204. The XOR unit 210 performs an exclusive OR operation on the program counter 220 and the global history 222, which produces a hash used as an index into the array 212.
The chooser 208 uses the program counter 220 to look up in a table which predictor (either the first predictor 202 or the second predictor 204) is more accurate. The chooser 208 makes a prediction choice 224, which is supplied as the selector to the multiplexer 206. The selected predictor is used as the Level 1 prediction 226 of the branch predictor 200.
FIG. 3 is a block diagram of an existing Level 2 branch predictor known as a hashed perceptron 300. The hashed perceptron 300 includes a bias weight array 302, a plurality of weight arrays 3041, 3042, . . . , 304n, and an adder 306. The program counter 310 is supplied as an input to the bias weight array 302 and the weight arrays 3041-304n.
The bias weight array 302 is an array of weights, where each weight is a number of bits (e.g., four or eight). The bias weight array 302 is indexed into using the program counter 310 or a hash of the program counter 310 to obtain a weight value that is supplied to the adder 306.
Each weight array 3041-304n is indexed by a hash of the program counter 310 and different bits of the global history 312 to obtain a weight value. Each weight array 3041-304n includes an XOR unit 314 to produce the hash by performing an exclusive OR operation on the program counter 310 and the portion of the global history 312. The global history is a list of past outcomes of all branches, not including the current branch, whether the branch was taken or not taken. The least significant bits of the global history contain information about the most recent branches encountered, while the most significant bits of the global history contain information about older branches encountered.
The adder 306 adds the weights obtained from the bias weight array 302 and each of the weight arrays 3041-304n to obtain a sum value, and the most significant bit (MSB) of the sum value is the prediction 316. For example, if the MSB of the sum value is “1,” then the prediction is “not taken” and if the MSB of the sum value is “0,” then the prediction is “taken.”
It is noted that in one implementation of the hashed perceptron 300, all of the weight values are sign-extended prior to the addition, to prevent an overflow of the adder 306, which could result in an incorrect prediction. Using a hash function to generate an index into the bias weight array 302 and each of the weight arrays 3041-304n generates a small index (in terms of the number of bits that make up the index), because both the program counter 310 and the global history 312 can each contain a large number of bits.
Branch predictors are typically large and complex structures. As a result, they consume a large amount of power and incur a latency penalty for predicting branches. It is desirable to have better branch prediction, because better branch prediction has an impact on the performance and the power efficiency of the processor. One challenge is how to improve branch prediction accuracy while not significantly changing the existing fast Level 1 predictor.