To improve performance, some processors may utilize branch prediction. For example, when a computer processor encounters an instruction with a conditional branch, branch prediction may be used to predict whether the conditional branch will be taken and subsequently causes retrieval of the predicted instruction rather than waiting for the current instruction to be resolved. To improve branch prediction accuracy, branch predictors often use large prediction tables and complex algorithms that make the branch prediction latency longer than one clock cycle. In addition to that, adding pipeline stages in view of the longer latency requirement eventually reduces the performance gain of branch predictors.
For example, a study using a cycle accurate simulator (CAS) of a processor shows that one additional clock cycle in branch prediction latency during a decoding process resulted in a 0.5% performance loss. Therefore reducing branch prediction latency while trying to attain high accuracy for branch prediction is very valuable.
FIG. 1 shows a branch predictor known in prior art. Referring to FIG. 1, D1 indicates the first decoding stage, D2 indicates the second decoding stage, and D3 indicates the third decoding stage. The address of the branch instruction is available at the end of the first decoding stage. The branch predictor as shown has a prediction latency of two clock cycles. Thus, one additional decoding stage (D3) is required to complete branch prediction process 101 as shown in FIG. 1. The predicted address in this branch predictor is available at the end of D3 stage. In comparison to a simple branch predictor with single-cycle prediction latency, the predicted address is available at the end of D2 stage. The one cycle loss as a result of the prediction latency may cause undesirable performance loss.