1. Field of the Invention
The present invention relates to the handling of branch instructions in a computer having a multistage instruction pipeline. More specifically, the present invention is a method and apparatus to minimize processing delays caused by branch instructions.
2. Description of the Related Art
Computers are programmable calculating machines that execute algorithms under the control of instructions. All but the simplest algorithms require frequent decisions as to how they are to proceed, and these decisions affect the sequence of control instructions. Because control instructions are obtained in advance of the operations they perform, and because the decision as to how an algorithm is to proceed is a function of the operations, there is an unavoidable lag from the time when the decision is made to the time when operations following the decision begin. This time is often called the xe2x80x9cbranch penalty.xe2x80x9d Branch penalties can occur any time there is a break in the control flow of a program, such as occurs with conditional branches, unconditional jumps, indirect jumps, or return instructions.
Reduced instruction set computers, commonly referred to as RISC processors, are one of the more common computer architectures in use today. In a nutshell, RISC processors rely on simple, low level instructions of the same size. Instruction execution is broken up into various segments and processed in a multistage pipeline. The pipeline is structured such that multiple instructions may be processed at any given instant. For example, a five-stage pipeline may include separate stages for fetching an instruction from memory (instruction fetch stage), decoding the instruction (decode stage), fetching operands the instruction needs (operand fetch stage), executing the instruction (execution stage) and writing the results back to the appropriate register or memory location (write back stage). Up to five instructions can be processed at once in such a pipelinexe2x80x94one in each stage. Thus, such a RISC computer can theoretically achieve performance equivalent to executing one instruction each clock cycle.
However, the existence and frequency of branch instructions limits the ability of a pipelined processor such as the RISC processor described above to achieve such performance. In the absence of special handling of branch instructions, the earliest the processor could possibly recognize that the branch is to be taken is at the instruction decode stage. At this point, however, the next instruction has already been fetched and possibly other actions have been taken. Thus, the fetched instruction and other actions must be discarded and a new instruction (the branch target) must be fetched.
This problem is compounded in current processors for two reasons. First, branches are common occurrences. Studies have shown that branch instructions generally occur about as often as once every five to ten instructions. Second, many current processors employ superscalar architectures that include multiple parallel pipelines capable of fetching and executing four or more instructions concurrently. In superscalar processors, it is more likely that a branch will be encountered, because more instructions are fetched in every cycle.
One way that programmers have addressed the branch problem is to implement elaborate schemes to predict whether a branch is likely to be taken and then fetch the branch target address as the next instruction rather than the next sequential instruction as appropriate. If the branch is correctly predicted, no delay in execution occurs. Only when the branch is incorrectly predicted is a throughput penalty suffered.
Predictive techniques that are well known in the art include various branch direction prediction methodologies, either alone or coupled with branch target address prediction. Direction prediction is an attempt to guess which way a branch will go before the condition is resolved. For example, one popular way to predict branch direction is to record a history of the past behavior of the particular branch instruction and then assume that the next time the branch is encountered, the direction selected will be the direction most often selected in the past. Alternatively, some code developers merely make static assumptions regarding the likely direction of a branch, either with hint codes in the branch itself, or simply by assuming that forward branches will not be taken and backward branches will be taken (reflecting the looping nature of many programs.)
Target address prediction is more difficult than direction prediction because branches typically have only two directions (taken or not taken), but may have billions of possible target addresses. Developers often include target address caches and/or return address stacks to speed the determination of a branch target address. A target address cache is typically a large RAM that stores the branch address and the likely target address.
Originally, target address caches were used as a mechanism for direction prediction. When an instruction was fetched, the same address was offered to the branch target cache, and when there was a match, the next instruction was fetched using the target address in the branch target cache. More recently, target address caches have also included other information useful in branch prediction, particularly for superscalar architectures. However, target address caches behave well only when the code they are executing has good locality of reference in the prediction caches. They are also unable to provide useful prediction information when a branch is first encountered.
Given the possibility ofmispredicting branches, code developers have taken other approaches to reduce branch penalty. For example, programmers may rely on the compiler to place one or more instructions after the branch that are to be executed regardless of whether or not the branch is evaluated as predicted. Such instructions are referred to as xe2x80x9cdelay slotxe2x80x9d instructions because they are positioned in the slot or slots immediately following the branch instruction. If there is no appropriate instruction to place in the branch delay slots, one or more xe2x80x9cno operationxe2x80x9d instructions can be placed there instead. This technique is commonly referred to as xe2x80x9cdelayed branching,xe2x80x9d because instructions that relate to the branch are delayed by the instructions that appear in the delay slots. The idea behind delayed branching is that useful work can be accomplished during the processor cycles required to load instructions at the branch target into the pipeline. When delayed branching is used in combination with prediction, useful work can be accomplished during the time the processor takes to flush instructions for mispredicted branches and load the proper instructions into the pipeline.
While delayed branching is simple in concept, the implementation is complicated by two serious issues. The first relates to interruptability. If a branch is underway, there are actually two different program countersxe2x80x94the one being branched to and the one from which the delay slot instructions were taken, where instructions are still being executed. If an interrupt is taken, both of these instruction pointers must be saved, and upon return from the interrupt both must be restored and proper sequencing of the operations begun. The second is that the number of instructions migrated after the branch is usually fixed by the processor architecture. In the early MIPS and SPARC architectures, only one instruction could occupy the delay slot. More recent architectures may have four or more delay slots. In every case, however, the delay slots are architectural: they must be accounted for by code developers, even if they are occupied by no operation instructions. Determining appropriate instructions for multiple delay slots without using numerous xe2x80x9cno operationxe2x80x9d instructions, which impact processor performance, can quickly become a very complex problem, particularly where there may be multiple sequential branches in the space of a few instructions.
Accordingly, it would be highly desirable to process branches by implementing a delayed branching-type technique, combined with branch prediction techniques, but without being hampered by architecturally-dictated delay slot requirements and their accompanying interruptability issues. The present invention comprises a method and apparatus that can eliminate instruction gaps behind branch instructions in a multistage pipelined processor by employing a pre-branch instruction far enough ahead of each actual branch instruction. The pre-branch instruction technique is like delayed branching, in that some number of instructions behind the pre-branch instruction will still load and execute, accomplishing useful work. However, it is unlike delayed branching, because the number of instructions that execute between the pre-branch instruction and its corresponding actual branch is not architecturally fixed but rather, is a matter of design implementation. Similarly, because the pre-branch instruction is not an actual branch but rather an upstream xe2x80x9chintxe2x80x9d that a branch is coming, the processor does not execute down two separate paths after the pre-branch instruction. Therefore, using a pre-branch instruction does not raise the interrupt and program counter issues that are inherently problematic in delayed branching.
The pre-branch instruction is placed at the point in the instruction stream where it will be at the decode stage in the pipeline while its corresponding branch instruction is at the first fetch stage in the pipeline. In the case of conditional branches, the pre-branch instruction states the condition upon which the branch depends. In a preferred embodiment, the pre-branch instruction also includes one or more prediction bits that indicate whether the branch is predicted to be taken or not taken. The pre-branch instruction is then decoded, and if the condition upon which the branch depends is known and dictates that the branch will be taken or if the condition is not known but the branch is predicted to be taken, the instruction fetch unit begins to fetch instructions at the branch target. If the condition is known and dictates that the branch will not be taken, or if the condition is not known but the branch is predicted to be not taken, then the instruction fetch unit continues to fetch instructions along the main execution path. If the pre-branch instruction has been properly placed in the instruction stream, there will be no gap in the instruction stream behind the branch instruction, for all conditional branches whose conditions are known at the time that thand pre-branch is decoded, and for all branches whose direction is correctly predicted. It is only when a branch""s conditions are unknown and its direction is mispredicted that a gap in the instruction stream can occur while the mispredicted instructions are cancelled and the instruction fetch unit is redirected to load the correct instructions into the execution pipeline.
The present invention comprises a method and apparatus that can eliminate instruction gaps behind branch instructions in a multistage pipelined processor by employing a pre-branch instruction far enough ahead of each actual branch instruction. The pre-branch instruction is placed at the point in the instruction stream where it will be at the decode stage in the pipeline while its corresponding branch instruction is at the first instruction fetch stage in the pipeline. In the case of conditional branches, the pre-branch instruction states the condition upon which the branch depends. In a preferred embodiment, the pre-branch instruction also includes one or more prediction bits that indicate whether the branch is predicted to be taken or not taken. The pre-branch instruction is then decoded. If the condition upon which the branch depends is known and dictates that the branch will be taken, or if the condition is not known but the branch is predicted to be taken, the instruction fetch unit begins to fetch instructions at the branch target. If the condition is known and dictates that the branch will not be taken, or if the condition is not known but the branch is predicted to be not taken, then the instruction fetch unit continues to fetch instructions along the main execution path. If the pre-branch instruction has been properly placed in the instruction stream, there will be no gap in the instruction stream behind the branch instruction, for all conditional branches whose conditions are known at the time that the pre-branch is decoded, and for all branches whose direction is correctly predicted. It is only when a branch""s conditions are unknown and its direction is mispredicted that a gap in the instruction stream can occur while the mispredicted instructions are cancelled and the instruction fetch unit is redirected to load the correct instructions into the execution pipeline.