Processing units are utilized in a multiple of applications. A standard configuration is to couple a processor with a storage unit, such as a cache, a system memory, or the like. Processors may execute a fetch operation to fetch instructions from the storage unit as needed.
In order to speed up the operation of the processor which is performing a fetch operation, a branch predictor may be used. A branch predictor predicts the direction of the branch instruction, (i.e., taken or not-taken), and the branch target address before the branch instruction reaches the execution stage in the pipeline.
This is known as “pre-fetching.” Although pre-fetching and speculatively executing the instructions without knowing the actual direction of the branch instruction may result in speeding up the processing of an instruction, it may have the opposite effect and may result in stalling the pipeline if the branch direction is mis-predicted. If the branch mis-prediction occurs, the pipeline needs to be flushed and the instructions are re-executed. This may severely impact the performance of the system.
Several different types of branch predictors have been used. A bimodal predictor may make a prediction based on recent history of a particular branch's execution, and give a prediction of taken or not-taken. A global predictor may make a prediction based upon recent history of all the branches' execution, not just the particular branch of interest. A two-level adaptive predictor with a globally shared history buffer, a pattern history table, and an additional local saturating counter may also be used such that the outputs of the local and the global predictors are XORed with each other to provide a final prediction. More than one prediction mechanism may be used simultaneously and a final prediction may be made based either on a meta-predictor that remembers which of the predictors has made the best predictions in the past, or a majority vote function based on an odd number of different predictors. However, branch predictors are typically large and complex. As a result, they consume a lot of power and incur a latency penalty for predicting branches.