1. Field of the Invention
The present invention relates to the field of microprocessors and, more particularly, a method and circuit for exact branch targeting.
2. Description of the Relevant Art
Pipelined microprocessors achieve high performance by executing an instruction per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of the instruction processing pipeline within the microprocessor. Storage devices (e.g., registers) capture values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively.
The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. The pipeline may be divided into any number of stages at which portions of the instruction processing are performed. The pipeline may include (1) a fetch stage for fetching the instruction, (2) a decode stage for decoding the instruction, (3) an execution stage for executing the instruction, and (4) a write back stage for writing the execution results in the destination identified by the instruction.
Many microprocessors employ a branch prediction mechanism. The branch prediction mechanism indicates a predicted direction (taken or not taken) for a conditional branch instruction, allowing subsequent instruction fetch to continue within the predicted instruction stream indicated by the branch prediction prior to actual resolution of the conditional branch instruction. The predicted instruction stream includes instructions immediately subsequent to the branch instruction in program order if the branch instruction is predicted not taken, or the instructions at the target address of the branch instruction if the branch instruction is predicted taken. Instructions from the predicted instruction stream may be speculatively executed prior to execution and resolution of the branch instruction, and in any case, are placed into the instruction processing pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e., one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded or purged from the instruction processing pipeline and the number of instructions executing per clock cycle is decreased.
Clearly, branch misprediction and the stalling that ensues is contrary to the goal of executing the most instructions per clock cycle in addition to the goal of keeping the pipeline full. There are several alternatives to employing a branch prediction mechanism for handling conditional branch instructions. The easiest alternative is to freeze the pipeline, holding any instructions after the branch until the conditional branch instruction is resolved and the branch destination is known. The attractiveness of this solution lies primarily in its simplicity. However, this solution requires that the stages prior to the execution stage remain empty until the conditional branch instruction is executed. A better and only slightly more complex solution is to always fetch the next instruction in program order as if the conditional branch is not taken. If the branch is taken, however, this alternative requires that the pipeline be flushed and the fetch restarted at the address at which the branch is taken. Another alternative, yet similar approach, is to always presume the branch as taken. Again, problems arise if the branch is not taken.
Some microprocessors employ another alternative to branch prediction mechanisms called delay branching. In delay branching, the compiler places the branch instruction earlier in the instruction stream. The instruction slots between the initial position of the branch instruction and the new position provided by the compiler for the branch instruction are provided with instructions which are independent of the branch instruction, valid, and useful. For example, the conditional branch instruction may be compiled so that four independent, valid, and useful instructions execute after the conditional branch instruction is fetched. Presuming the pipeline to be four stages long, the conditional branch instruction can be evaluated ahead of time so that no prediction occurs. In delayed branching, the branching instruction executes in the cycle it would have been fetched in the original instruction stream.
One problem with delayed branching is that delay branching cannot be implemented in existing x86 code. Delayed branching must use a modified instruction set. However, it appears that delay branching can be implemented in existing x86 instruction set if the microcode program is adequately modified. Another problem associated with delayed branching relates to superscalar microprocessors which have several pipelines. If the branch instruction is moved up to an early position within the instruction stream by the compiler, the branch instruction may not execute in the execute stage at the appropriate time since superscalars can execute multiple instructions simultaneously. This is particularly problematic with respect to loops where the entire loop can be executed in one clock cycle. In other words, superscalar microprocessors may limit the positions at which the branch instruction can be moved. If the branch instruction cannot be moved far enough up into the instruction stream, the pipeline may have to be operated with no OPs to compensate for the limitation. Executing no OPs limits microprocessor operating speed.