With the goal of making computers faster and more powerful, many techniques are being developed to execute a microprocessor's resources in parallel and as efficiently as possible. For instance, while microprocessors have, in the past, typically executed a single instruction at a time, modem super-scalar processors execute multiple instructions in a single cycle. To efficiently supply enough instructions to execute simultaneously, multiple instructions are fetched in a single fetch cycle. Due to the pipelined nature of such a microprocessor, it is often necessary to fetch instructions before some conditional branch instruction has executed, where that branch instruction may change the control flow such that the fetched instructions are the wrong instructions to execute. This presents a waste in memory access bandwidth, latency, processing and re-access time.
Research has shown that the outcome of conditional branches can be predicted with a high degree of accuracy based on past behavior. With such a prediction, a much more efficient use of memory-access bandwidth and faster execution can be achieved.
For example, Yeh and Patt, "Two-Level Adaptive Branch Prediction", The 24th ACM/IEEE International Symposium and Workshop on Microarchitecture, (November 1991), pp. 51-61, developed a global history (ghist) branch predictor in 1991 as a generalization of their two-level branch prediction scheme. Based on a pattern of past branch taken/not taken decisions, independent of instruction address, a prediction of a next decision is made. The ghist scheme maintains a shift register whose bits represent taken/not-taken results of previous conditional branches in an instruction sequence. The next decision for a particular pattern is indicated by a value indexed by the shift register. The values are obtained by incrementing and decrementing individual counters associated with the pattern based on decisions already made.
Predictions are stored in n-bit saturating counters, where n is typically two, and an array of n-bit saturating counters is indexed by the ghist shift register. Thus, based on the history in the ghist shift register, an indexed counter is used to predict whether a branch is taken or not taken. Typically, a threshold is used to determine the prediction. For example, if a counter's value is above the threshold, a branch may be predicted as taken. If the counter value is below the threshold, the branch is predicted as not taken. If the threshold is one half of the full range of the counter, this may be simplified by using the most significant bit of the counter as the prediction indicator. Alternatively, various forms of hysteresis have been used or proposed. Regardless of the exact method used, the bit-pattern in the shift register has been found to accurately predict which branch is taken.
Yeh and Patt, "Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache", Proceedings of the 7th ACM International Conference on Supercomputing, July 1993, extended this work to allow multiple branch predictions to occur almost in parallel. The described branch predictor counter array, or pattern history table, is multi-ported, having seven read ports for three predictions. Two multiplexors (one 2:1, and one 4:1) are required after the arrays to choose the correct predictions. The multiplexor controls are serial, that is, the output of the first is a control input for the second. The capacity of the predictor is limited due to the need to have seven read ports.
McFarling, "Combining Branch Predictors", WRL Technical Note TN-36, Digital Equipment Corporation, June 1993, extends Yeh and Patt's ghist work by hashing the contents of the global history register with the address of the branch instruction to compute the index into the counter table. This scheme is called "gshare" and it improves the accuracy of the predictions.
Dutta and Franklin, "Block-Level Prediction for Wide-Issue Superscalar Processors," Proc. Ist International Conference on Algorithms and Architectures for Parallel Processing, pp. 143-152, April 1995, and "Control Flow Prediction with Tree-Like Subgraphs for Superscalar Processor," Proc. 28th International Symposium on Microarchitecture, pp. 253-263, 1995 also recognized the need to perform multiple branch predictions in a cycle. They perform "block prediction" in which they determine which branch in a group of fetched instructions will be taken, and predict the target address using "block history". There appear to be two limitations to this scheme. First, although the branch predictor must be able to begin a new access every cycle in order to maintain a fully populated pipeline, two array accesses are required in series which may limit the repeat rate of the predictor. Second, the latency of the predictor is at least two full cycles due to the series array structure.
This research presents a problem in designing a very wide issue superscalar processor. In order to achieve high-performance, many branch predictions must be predicted per cycle. However, the state of the branch predictor must be large to be able to accurately predict the branches of complex programs, and the machine must be buildable and operate at a high frequency. The above schemes suggest either series arrays or arrays with a large number of ports.