Microprocessors (referred to herein simply as “processors”) consume energy/power during their operation. It is advantageous to reduce the amount of energy consumed, particularly in the case of devices that run off limited power supplies.
Various factors affect the amount of energy that a processor consumes. For example, the frequency at which the processor operates, the voltage level that powers the processor, as well as the load capacitances, affect processor energy consumption. Reducing the frequency of the processor or the voltage supply may decrease processor energy consumption, however, doing so may also adversely affect the performance of the processor.
Other techniques to reduce energy consumption, by for example reducing load capacitances, may include changes to processor architectures and processor circuits. Some other techniques rely on modifying the application itself, or any other system layer, to improve energy efficiency.
Branch prediction relates to activities that determine the target address of an instruction that changes the control-flow of a program at runtime such as a branch instruction.
As noted in the literature, branch prediction mechanisms related energy consumed is a significant fraction of a processor's energy consumption. In addition, branch prediction accuracy affects both performance and overall processor energy consumption.
Various factors affect the energy impact of branch prediction. In the following paragraphs, we provide an introduction into various branch prediction mechanisms.
There are different prediction approaches in place today to determine the target of a branch in time, i.e., to avoid the penalty of fetching from the wrong address in the case of a taken branch.
One such approach is static prediction that assumes that a branch behaves uniformly, i.e., it is either taken or not taken, during the entire execution of a program.
Static predictors require typically no hardware-table lookup; they need to find the direction of a branch. This can be easily accommodated in many conditional branch instructions.
For forward-pointing branches such a predictor would typically predict “not taken” as branch outcome, and for backward-pointing it would predict a “taken” target. This seemingly arbitrary rule is based on statistical branch outcomes in programs. For example, it is known that backward-pointing branches are taken 90% of the time and forward-pointing ones are not taken 56% of the time for integer programs such as in the SPEC benchmark suite.
If the prediction is that the branch will be taken, the target address would be calculated at runtime and the next instruction-group fetched from the target address, rather than sequentially. A correctly predicted branch provides, typically, an execution with no performance penalty. Moreover, in architectures such as ARM9 or ARM10 a branch can be folded (e.g., pulled or removed) from the pipeline resulting in what is called zero-cycle branches.
A part of the energy cost of a static prediction is related to pre-decoding instructions in a fetch-buffer, or prefetch buffer, to find out if a fetched instruction is a branch. Pre-decoding is required to determine changes in control-flow before a branch instruction enters the pipeline. In addition, checking the direction of a branch, i.e., to determine if a branch is forward- or backward-pointing, is necessary. The direction of a branch can be often determined by decoding the sign bit of the branch offset. Negative offsets represent backward-pointing branches while positive ones correspond to forward-pointing ones.
If a branch is predicted taken, the target address is typically calculated by performing an arithmetic operation. This is often an addition of the program counter and the constant offset decoded from the branch instruction itself.
Static prediction schemes are often limited to conditional branches where the offset is a constant. Branches that use registers to calculate the address, or indirect branches, cannot typically be predicted before the register content is available in the pipeline. Other complementary schemes such as a return stack would be used for branching related to procedure calls and returns.
While the energy consumed by static prediction is relatively low, such prediction is often not preferred or used, due to its low prediction accuracy. Static predictors rarely exceed 60%-65% prediction accuracy, although in some cases, e.g., for some branch instructions, they can do much better.
During a branch misprediction, there is a high performance and energy penalty. This is due to the extra unnecessary instruction fetches and decodes: a result of fetching from incorrect execution paths. The performance and energy penalty of a misprediction depends on the pipeline depth of the processor and especially the number of stages between the fetch and execute stages and the size of the prefetch buffers, that determine the extra fetches resulting from a mispredicted branch. Even in relatively short pipelines, the branch prediction penalty is significant, e.g., can be as high as 4-5 cycles in modern processors.
A variant of static prediction is based on compile-time static prediction. In such a scheme, a bit in each branch instruction is set by the compiler and determines the direction of a branch. As no hardware lookup is required in either case, the power cost of a runtime static predictor and a compile-time static predictor are similar.
Due to increasing performance requirements of emerging applications, e.g., emerging applications require both wireless and internet capabilities, processor vendors generally prefer dynamic branch predictors.
Dynamic predictors improve the accuracy of branch prediction, and therefore performance, for most applications.
Dynamic predictors are based on runtime mechanisms. Branch history tables or similar auxiliary hardware tables are used to predict the direction of a branch for each branch instruction occurrence. At runtime these tables are continuously updated to better reflect likely branch outcome.
A key difference between static and dynamic prediction is that the same branch instruction can be predicted differently depending on its execution context in the dynamic case. Branch target address caches are often used to store the target address, in addition to the prediction outcome information. This can speed up the process as the target address is readily available after the table lookup.
There are many different dynamic predictors available. For example, depending on whether only local branch-related information is used or if local branch information is combined with global information about the program context in which the branch occurs, at a given time, one can build various dynamic predictors. Many predictors are multi-level requiring lookup of multiple large hardware tables. These tables are implemented similar to caches, with search tags and data array segments. The search is done through an associative lookup mechanism.
Dynamic predictors are typically more accurate than the static predictors, as they capture the dynamic behavior of branches rather than predicting a branch outcome always taken or not taken. Therefore, they are almost always preferred in processor designs that issue multiple instructions per cycle, where the misprediction penalty due to branch misses is fairly large due to the high instruction fetch rates.
Predicting a branch with a dynamic predictor require looking up branch history tables to determine the prediction. The hardware complexity, table size, and associativity of these tables, as well as the lookup frequency, are key determining factors in the ultimate energy efficiency achieved.
A key challenge with branch prediction is, however, not only to determine a likely target address or predict a change of control, i.e., if a branch is taken, but to predict it before the branch is decoded in the pipeline. Otherwise, potentially unnecessary instructions are already fetched on the sequential path adding to the energy consumption and imposing a performance penalty.
A typical solution, often used in today's processors, is to predict branches in parallel with the instruction fetch. In this case, however, branch prediction is performed with almost every instruction fetch, regardless of whether a fetched instruction is a branch or not. This is very energy inefficient as many more table lookups are performed than actual branch instructions would require in a program. For example, assuming that there is one branch instruction per five regular (non-branch) instructions, this would roughly mean five times more branch table lookups and five times more energy consumed.