1. Technical Field
The present invention relates generally relates to processing of branch instructions in a microprocessor, and more particularly, to methods and apparatus for implementing polymorphic branch predictors.
2. Description of the Related Art
Modern processors achieve performance by applying prediction techniques to address pipeline disruption events, such as branch operations. In accordance with the prior art, a variety of branch processing techniques have been provided. A branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. This is called branch prediction. Branch predictors are crucial in today's modern, superscalar processors for achieving high performance. Branch predictors permit processors to fetch and execute instructions without waiting for a branch to be resolved.
Early implementations of RISC architectures did trivial branch prediction: e.g., the architectures always predicted that a branch (or unconditional jump) would not be taken, so they always fetched the next sequential instruction. Only when the branch or jump was evaluated did the instruction fetch pointer get set to a nonsequential address. These CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Some architectures defined branch delay slots in order to utilize these fetched instructions.
Processors that implement “static prediction” predict that backwards pointing branches will be taken (assuming that the backwards branch is the bottom of a program loop), and forwards pointing branches will not be taken (assuming they are early exits from the loop or other processing code). For a loop that executes many times, this only mispredicts the very last branch of the loop. Static prediction is used as a fall-back technique (when there is no any information for dynamic predictors to use) in most processors with dynamic branch prediction.
Some superscalar processors fetch, with each line of instructions, a pointer to the next line. This next line predictor is not directly comparable to the other predictors listed here because the next line predictor handles branch target prediction as well as branch direction prediction. When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again, assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded. The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.
A bimodal branch predictor has a table of two-bit saturating counters, indexed with the least significant bits of the instruction addresses. Unlike the instruction cache, bimodal predictor entries typically do not have tags, and so a particular counter may be mapped to different branch instructions (this is called branch interference or branch aliasing), in which case it is likely to be less accurate. Each counter has one of four states: 1) Strongly not taken, 2) Weakly not taken, 3) Weakly taken and 4) Strongly taken.
When a branch is evaluated, the corresponding counter is updated. Branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The primary benefit of this two bit saturating counter scheme is that loop closing branches are always predicted taken. A one-bit scheme, mispredicts both the first and last branch of a loop. A two-bit scheme mispredicts just the last branch. Similarly, on heavily biased branches which almost always go one way, a one-bit scheme mispredicts twice for each odd branch, and a two-bit scheme mispredicts once.
Because the bimodal counter table is indexed with the instruction address bits, a superscalar processor can split the table into separate SRAMs for each instruction fetched, and fetch a prediction for every instruction in parallel with fetching the instruction, so that the branch prediction is available as soon as the branch is decoded. In addition to 2-bit predictors, a variety of similar saturating counter based predictors using n bits are possible.
Bimodal branch prediction mispredicts the exit of every loop. For loops which tend to have the same loop count every time (and for many other branches with repetitive behavior), some predictors can do better. Local branch predictors keep two tables. The first table is the local branch history table. It is indexed by the low-order bits of the branch instruction's address, and it records the taken/not-taken history of the n most recent executions of the branch. The other table is the pattern history table. This table includes the actual predictors; however, its index is generated from the branch history in the first table. To predict a branch, the branch history is looked up, and that history is then used to look up a predictor to make a prediction. This approach can use either a single bit predictor, or an n bit predictor (such as bimodal predictor).
Local prediction is slower than bimodal prediction because local prediction requires two sequential table lookups for each prediction. A fast implementation would use a separate bimodal counter array for each instruction fetched, so that the second array access can proceed in parallel with instruction fetch. These arrays are not redundant, as each counter is intended to store the behavior of a single branch. Global branch predictors make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches. In one implementation, a predictor can keep a single shift register updated with the recent history of every branch executed, and use this value to index into a table of predictors (e.g., single bit or bimodal counter predictors).
A gselect predictor indexes a table of predictors with the recent history concatenated with a few bits of the branch instruction's address. Gselect does better than local prediction for small table sizes, and local prediction is only slightly better for table storage larger than 1 KB. Another implementation offers better prediction accuracy than gselect by XORing the branch instruction address with the global history, rather than concatenating, at the cost of the more expensive XOR in lieu of a simple concatenation. This predictor is referred to as gshare, which is a little better than gselect for tables larger than 256 bytes.
Gselect and gshare are easier to make fast than local prediction, because they require a single table lookup per branch. As with bimodal prediction, the table can be split so that parallel lookups can be made for each instruction fetched, so that the table lookup can proceed in parallel with instruction load. Scott McFarling proposed combined branch prediction in “Combining Branch Predictors”, WRL Technical Note 36, 1993. Such combined predictors are referred to as multi-component predictors in the descriptions hereinbelow. Combined branch prediction is about as accurate as local prediction, and almost as fast as global prediction.
Combined branch prediction uses three predictors in parallel: e.g., a local bimodal, gshare, and a bimodal-like predictor to pick which of bimodal or gshare to use on a branch-by-branch basis. The choice predictor can be a single bit predictor, or saturating n bit counter, used for choosing the prediction to use. In this case the counter is updated whenever the bimodal and gshare predictions disagree, to select which result to choose. Another way of combining branch predictors is to have, e.g., 3 different branch predictors, and merge their results by a majority vote. Predictors like gshare use multiple table entries to track the behavior of any particular branch. This multiplication of entries makes it much more likely that two branches will map to the same table entry (a situation called aliasing), which in turn makes it much more likely that prediction accuracy will suffer for those branches. Once multiple predictors are employed, it is beneficial to arrange that each predictor will have different aliasing patterns, so that it is more likely that at least one predictor will have no aliasing. Combined predictors with different indexing functions for the different predictors are called gskew predictors, and are analogous to skewed caches used for data and instruction caching.
Another technique to reduce destructive aliasing within the pattern history tables is an agree predictor. A method is used to establish a relatively static prediction for the branch, perhaps a bimodal predictor or hint bits within the branch instruction. Another predictor (e.g., a gskew predictor) makes predictions, but rather than predicting taken/not-taken, the predictor predicts agree/disagree with the base prediction. The intention is that if branches covered by the gskew predictor tend to be a bit biased in one direction, perhaps 70%/30%, then all those biases can be aligned so that the gskew pattern history table will tend to have more agree entries than disagree entries. This reduces the likelihood that two aliasing branches would best have opposite values in the prediction history table (PHT).
Agree predictors work well with combined predictors, because the combined predictor usually has a predictor which can be used as the base for the agree predictor. Agree predictors do less well with branches that are not biased in one direction, if that causes the base predictor to give changing predictions. So an agree predictor may work best as part of a three-predictor scheme, with one agree predictor and another non-agree type predictor.
Almost all pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Key parameters in designing branch prediction techniques are the number of branch prediction entries, and the branch prediction algorithm, such as single bit predictors, or saturating n-bit predictors. These decisions have to be applied to a variety of branch prediction methods, for local or global predictors.
While the prior art has allowed a combination of a variety of predictors, a key decision for microprocessor designers has been the choice of branch prediction algorithms. In accordance with prior art, with a fixed memory allocation of k bits, designers have had the ability to implement each prediction table to have either k single bit predictors, or k/2 bimodal predictors, or more generally, k/n predictors with n bit counters. This represents a tradeoff between offering predictors which permit improvement in quality of single predictions by using more bits for each prediction, or to offer more simply structured predictors. In another tradeoff, designers have the possibility to opt for longer latency local predictors, or short latency bimodal or single bit predictors.
The best prediction quality depends on a variety of factors, such as workload-specific properties, which may differ for different programs, or between programs. Thus, while the state of the art has permitted the combination of predictors, it has not permitted optimization of the prediction to a specific application, or even phase within an application. Instead, structure (such as tournament predictors), the use of global or local prediction, and the choice of 1 bit or bimodal predictors had to be fixed at design time, requiring an implementer to select a specific configuration once and the configuration was to be used for all applications.
While predictor design has permitted good average performance, the prior art has not been able to optimize predictors for specific applications.