1. Field of the Invention
The present invention relates to pipelined computers and, in particular, to a branch resolution scheme for in-order pipelined computers.
2. Discussion of the Related Art
In a non-pipelined computer, each instruction executed by the computer is processed until the instruction is completed before processing begins on the next instruction in the sequence. Computer performance can be increased by pipelining instructions to increase the speed of the computers. A pipelined computer divides instruction processing into a series of steps or stages, such as fetch, decode, execute, and write, where each of the stages is executable in a single clock cycle. Because the stages are pipelined, the computer can operate on different instructions simultaneously at different stages of the pipeline. Therefore, once the pipeline is full, an instruction is generally completed every clock cycle, thereby increasing the computer""s throughput.
For example, during a first operation cycle, a first instruction is fetched from memory in the fetch stage. During a second operation cycle, the first instruction is decoded in the decode stage and a second instruction is fetched from memory in the fetch stage. During a third operation cycle, the first instruction is executed in the execute stage, the second instruction is decoded in the decode stage, and a third instruction is fetched from memory in the fetch stage. In a fourth operation cycle, the result of the first instruction is written to registers and memory in the write stage, the second instruction is executed in the execute stage, the third instruction is decoded in the decode stage, and a fourth instruction is fetched in the fetch stage. Processing continues in the pipeline such that a result from each instruction is available every operation cycle. Thus, in the example, a pipelined computer can process four instructions simultaneously, whereas a non-pipelined computer can only process one instruction at a time. Accordingly, the overall speed of computer processing can be significantly increased over a non-pipelined computer.
A pipelined computer operates most efficiently when instructions are executed in the order in which they appear in memory so that each instruction proceeds sequentially through the stages with another instruction proceeding sequentially one stage behind. However, some types of instructions can cause execution to jump to a specified instruction that is different from the next instruction in the sequence, thereby disturbing the sequence of instructions. One such type of instruction is a branch instruction, which either causes processing to jump to a new instruction at a target address designated by the branch instruction (branchxe2x80x9ctakenxe2x80x9d) or allows processing to continue with the next sequential instruction (branch xe2x80x9cnot takenxe2x80x9d).
As with other instructions, a branch instruction proceeds sequentially through the pipeline. Thus, the ranch instruction and each following instruction is processed through succeeding stages for several clock cycles until the branch instruction has completed the execution stage. At this point, the branch is resolved, i.e., taken or not taken. If the branch is taken, the fetch stage fetches the instruction at the target address, and the instructions following the branch instruction are cleared from each stage to xe2x80x9cflushxe2x80x9d the pipeline. Processing then continues with the instruction at the target address. However, the stages that were flushed are inactive until the instruction reaches each stage, which reduces the efficiency of the pipeline. If the branch is resolved as not taken, processing continues along the pipeline. Branches can be conditional (resolved as taken or not taken) or unconditional (always taken).
One way to increase the performance of executing a branch instruction is to predict the outcome of the branch instruction, i.e., in the decode stage, before it is executed. If the branch is predicted as taken, the instruction at the target address is fetched and inserted into the pipeline immediately following the branch instruction. However, if the branch is predicted as taken but resolved as actually not taken, the stages following the branch instruction are flushed and a mispredict penalty is incurred. If the branch is predicted as not taken, then the pipeline continues with normal sequential processing. However, if the branch is predicted as not taken but resolved as actually taken, the stages following the branch instruction are flushed, and the instruction at the target address is fetched and inserted into the pipeline. Again, a mispredict penalty is incurred. Accordingly, processing efficiency is reduced only when the branch is mispredicted.
Several types of branch prediction schemes have been developed to decrease the chances of misprediction and are well known in the art. One type of branch prediction scheme uses a branch target buffer (BTB) that stores a plurality of entries including an index to a branch instruction. In addition to the index, each entry may include an instruction address, an instruction opcode, and history information. With a branch target buffer, each instruction is monitored as it enters into the pipeline. An instruction address matching an entry in the branch target buffer indicates that the instruction is a branch instruction that has been encountered before. After the entry has been located, the history information is tested to determine whether or not the branch will be predicted to be taken.
Typically, the history is determined by a state machine which monitors each branch in the branch target buffer, and allocates bits depending upon whether or not a branch has been taken in the preceding cycles. If the branch is predicted to be taken, then the predicted instructions are inserted into the pipeline. Typically, the branch target entry will have opcodes associated with it for the target instruction, and these instructions are inserted directly into the pipeline.
Branch prediction schemes, however, do not completely eliminate mispredictions. With branch prediction schemes that are not 100% accurate, the computer must wait until the branch is resolved to determine if the branch was correctly predicted, typically in the execute or write stages. Because the branch may take several stages or clock cycles to resolve, the mispredict penalty can be substantial. Therefore, it is desirable to resolve the branck instruction as early as possible to decrease the amount of mispredict penalty incurred by waiting for branch resolution. Therefore, it is desirable to resolve the branch instruction as early as possible to decrease amount of mispredict penalty incurred by waiting for branch resolution.
In accordance with the present invention, branch resolution for an in-order pipelined processor is performed by scanning the stages of the pipeline to determine the oldest conditional branch instruction (i.e., the branch instruction farthest along in the pipeline) having enough condition codes for resolution.
After a branch is predicted as taken or not taken, stages in the pipeline are scanned from the later stages to the earlier stages until a stage is found with the necessary condition codes to resolve a branch, thereby allowing an in-order processor to quickly and simply resolve a branch as soon as enough condition codes are generated in a specific stage. If the branch resolution determines that the branch has been mispredicted, then program control is shifted to an alternate program counter (PC) to fetch the correct target address, insert that address into the pipeline, and clean out the pipeline. By resolving branches as soon as possible, branch mispredict penalties are minimized, thereby increasing the efficiency of the processor.
In one embodiment of the present invention, branch prediction occurs in the convert (C) stage, and branch resolution scans the pipeline from the stage (Z) after the write stage back to the decode (D) stage, i.e., Zxe2x86x92Wxe2x86x92Exe2x86x92Mxe2x86x92Axe2x86x92Rxe2x86x92D, to determine the oldest branch having sufficient condition codes for resolution. W is the write-back stage, E is the execute stage, M is the memory stage, A is the address generation stage, and R is the read stage.