Microprocessors often employ the use of pipelining to enhance performance. Within a pipelined microprocessor, the functional units necessary for executing different stages of an instruction operate simultaneously on multiple instructions to achieve a degree of parallelism leading to performance increases over non-pipelined microprocessors.
As an example, an instruction fetch unit, an instruction decode unit and an instruction execution unit may operate simultaneously. During one clock cycle, the instruction execution unit executes a first instruction while the instruction decode unit decodes a second instruction and the fetch unit fetches a third instruction.
During a next clock cycle, the execution unit executes the newly decoded instruction while the instruction decode unit decodes the newly fetched instruction and the fetch unit fetches yet another instruction. In this manner, neither the fetch unit nor the decode unit need to wait for the instruction execution unit to execute the last instruction before processing new instructions. In state-of-the-art microprocessors, the steps necessary to fetch and execute an instruction are sub-divided into a larger number of stages to achieve a deeper degree of pipelining.
A pipelined CPU operates most efficiently when the instructions are executed in the sequence in which the instructions appear in the program order. Unfortunately, this is typically not the case. Rather, computer programs typically include a large number of branch instructions, which, upon execution, may cause instructions to be executed in a sequence other than as set forth in the program order.
More specifically, when a branch instruction is encountered in the program flow, execution continues either with the next sequential instruction or execution jumps to an instruction specified as the "branch target". Typically the branch instruction is said to be "Taken" if execution jumps to an instruction other than the next sequential instruction, and "Not Taken" if execution continues with the next sequential instruction.
Branch instructions are either unconditional, meaning the branch is taken every time the instruction is executed, or conditional, meaning the branch is dependent upon a condition. Instructions to be executed following a conditional branch are not known with certainty until the condition upon which the branch depends is resolved.
However, rather than wait until the condition is resolved, state-of-the-art microprocessors may perform a branch prediction, whereby the microprocessor predicts whether the branch will be Taken or Not Taken, and if Taken, predicts the target address for the branch. If the branch is predicted to be Taken, the microprocessor fetches and speculatively executes the instructions found at the predicted branch target address. The instructions executed following the branch prediction are "speculative" because the microprocessor does not yet know whether the prediction is correct.
In addition, as state-of-the-art microprocessors continue to push the standard of performance for today's computer systems to greater heights, the microprocessor's clock frequency (i.e., processing speed) continues to increase. As a result, less time is available for actual computations within each pipeline stage. Most microprocessors have compensated by increasing the number of stages in the pipelined core, while decreasing the levels of logic at each stage.
The tradeoff for the deeply pipelined solutions is the increasing performance penalty for "taking" branch instructions. More specifically, the microprocessor typically resolves branch instructions at the back-end of the pipeline. Therefore, if a branch instruction is determined, by the back-end of the pipeline, to be taken, then all instructions presently in the pipeline, behind the taken branch instruction, are typically flushed (i.e., disregard).
The processor thereafter begins fetching and executing instructions at the branch target address as specified by the taken branch instruction. The apparent unnecessary processing of the instructions that have been flushed, because of the branch instruction being taken, is therefore considered to be the "penalty" of the branch instruction.
Thus, maintaining a competitive performance with a microprocessor having a high clock frequency places a significant emphasis on branch prediction units to curve the penalty of branch instructions being taken.