1. Field of the Invention
The present invention relates to the field of microprocessors and, more particularly, to branch prediction mechanisms within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
An important feature of a superscalar microprocessor (and a superpipelined microprocessor as well) is its branch prediction mechanism. The branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instruction, allowing subsequent instruction fetches to continue with the predicted instruction stream indicated by the branch prediction. The predicted instruction stream includes instructions immediately subsequent to the branch instruction in memory if the branch instruction is predicted not-taken, or the instructions at the target address of the branch instruction if the branch instruction is predicted taken. Instructions from the predicted instruction stream may be speculatively executed prior to execution of the branch instruction, and in any case are placed into the instruction processing pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cycle is decreased.
One type of instruction for which branch prediction techniques are employed is the loop instruction. A loop instruction is used to execute a loop, or sequence of instructions, a number of times in succession. The number of executions, or iterations, of the loop is known as the loop count. The loop count is typically set by initializing a specified register prior to execution of the loop. In many cases, the specified register is pre-defined by the loop instruction. As used herein, the specified register used by the loop instruction is referred to as the "counter register".
A loop is delimited by the loop instruction, which executes as the last instruction in the loop. The loop instruction decrements the counter register (previously initialized with the loop count) and branches to a specified target address if the counter register is greater than zero. Since the specified target address is located at the beginning of the loop, branching to the specified target address causes another iteration of the loop to be performed. This sequence continues until the counter register is equal to zero. In this case, the loop instruction does not branch to the specified target address. Instead, execution continues with instructions located subsequent to the loop instruction in memory.
One example of a loop instruction is the "LOOP" instruction defined by x86 instruction set. This instruction uses the ECX register (or CX register, if operating in 16-bit mode) as the counter register. Similar to a generic loop instruction described above, the x86 "LOOP" instruction operates by decrementing the value in the counter register and branching to a target address specified as an operand of the instruction if the new value of the counter register is greater than zero.
In many cases, loop instructions are always predicted to be taken. In this manner, the prediction mechanism is correct in every iteration of the loop except the last one. For many implementations, this technique represents an acceptable level of accuracy. As the number of pipeline stages in microprocessors increases due to higher clock frequencies, however, the penalty for mispredicted branches increases as well. It thus becomes important to improve branch prediction accuracy as much as possible.
It would therefore be desirable to increase the accuracy of the branch prediction mechanism used in conjunction with the loop instruction.