The present invention relates generally to the field of processors and in particular to an effective organization for a branch history table in a processor having variable length instruction set execution modes.
Traditional instruction set architectures for processors have a uniform instruction length. That is, every instruction in the instruction set comprises the same number of bits (e.g., 16 or 32). Processors having variable length instruction set execution modes—wherein the processor may execute instructions having different bit lengths—are known in the art. For example, recent versions of the ARM architecture include 16-bit instructions that are executed in a 16-bit instruction set execution mode (Thumb mode) as well as the traditional 32-bit ARM instructions that are executed in a 32-bit instruction set execution mode (ARM mode).
One problem with processors executing variable length instructions is that instructions do not fall on uniform memory boundaries. Accordingly, circuits or operations that increment through, or randomly address, instructions (or ancillary constructs associated with instructions) cannot utilize a uniform incrementing or addressing scheme. Rather, they must alter the addressing scheme based on the length of instructions currently being executed, i.e., the current instruction set execution mode.
Most modern processors employ a pipelined architecture, where sequential instructions, each having multiple execution steps, are overlapped in execution. For maximum performance, the instructions should flow continuously through the pipeline. Any situation that causes instructions to stall in the pipeline detrimentally affects performance. If instructions must be flushed from the pipeline and subsequently re-fetched, both performance and power consumption suffer.
Virtually all real-world programs include conditional branch instructions, the actual branching behavior of which is not known until the instruction is evaluated deep in the pipeline. To avoid the stall that would result from waiting for actual evaluation of the branch instruction, most modern processors employ some form of branch prediction, whereby the branching behavior of conditional branch instructions is predicted early in the pipeline. Based on the predicted branch evaluation, the processor speculatively fetches and executes instructions from a predicted address—either the branch target address (if the branch is predicted taken) or the next sequential address after the branch instruction (if the branch is predicted not taken). When the actual branch behavior is determined, if the branch was mispredicted, the speculatively fetched instructions must be flushed from the pipeline, and new instructions fetched from the correct next address. Speculatively fetching instructions in response to an erroneous branch prediction adversely impacts processor performance and power consumption. Consequently, improving the accuracy of branch prediction is an important design goal.
Several methods of branch prediction are based on the branch evaluation history of the branch instruction being predicted and/or other branch instructions in the same code. Extensive analysis of actual code indicates that recent past branch evaluation patterns may be a good indicator of the evaluation of future branch instructions.
One known form of branch prediction utilizes a Branch History Table (BHT) to store an indication of recent branch evaluations. As one example, the BHT may comprise a plurality of saturation counters, the MSBs of which serve as bimodal branch predictors. For example, each counter may comprise a 2-bit counter that assumes one of four states, each assigned a weighted prediction value, such as:
11—Strongly predicted taken
10—Weakly predicted taken
01—Weakly predicted not taken
00—Strongly predicted not taken
The counter increments each time a corresponding branch instruction evaluates “taken” and decrements each time the instruction evaluates “not taken.” The MSB of the counter is a bimodal branch predictor; it will predict a branch to be either taken or not taken, regardless of the strength or weight of the underlying prediction. A saturation counter reduces the prediction error of an infrequent branch evaluation, as a single branch evaluation in one direction will not change the prediction of a counter that is saturated in the other direction.
In the case of a “local” BHT, each branch instruction, the branch evaluation of which is being predicted, is associated with a single BHT counter. Accordingly, the BHT is indexed with part of the branch instruction address (BIA). Many modern processors fetch a plurality of instructions in blocks or fetch groups, in a single fetch operation. In this case, the address associated with the block or fetch group is considered a BIA, as the term is used herein. In the case of a “global” BHT, recent global branch evaluation history may be concatenated with (gselect) or hashed with (gshare) the BIA prior to indexing the BHT counters.
Instruction sets having different instruction lengths complicate the design of BHTs. In particular, the BHT is ideally indexed differently in each different instruction set execution mode, since each counter is associated with a branch instruction, and the instructions fall on different memory boundaries in different instruction set execution modes. One known solution is to simply size the BHT based on the largest instruction length, but address it based on the smallest instruction length. This solution leaves large pieces of the table empty, or with duplicate entries associated with longer branch instructions. Another known solution is to multiplex the BHT index addresses, effectively using a different part of the instruction address in each different instruction set execution mode. This adds a large number of multiplexers, which increases silicon area and power consumption. More critically, however, it adds delay to a critical path, thus increasing the cycle time and adversely impacting processor performance.