The invention relates to pipelined computer architectures, and more particularly to efficient use of branch delay slots and branch prediction in pipelined computer architectures.
Programmable computers comprise processing circuitry and some sort of storage mechanism (“memory”) for storing data and program instructions. In their simplest form, computers operate on a principle in which an instruction is fetched from memory and the processor executes (i.e., performs) the instruction. Execution of instructions may involve fetching one or more data operands from memory and/or register locations, producing some sort of data based on the fetched data, and then storing the result into a memory and/or register location.
A key characteristic of programmable processors is the ability for the processor itself to select which one of a number of sets of instructions will be executed based on the present state of one or more conditions. To take a very simple example, if a particular data item has a value of zero, the program designer may intend for one set of instructions to be performed, whereas if the particular data item has a nonzero value, the program designer may intend for a different set of instructions to be performed. The tested data item may have different values at different times during program execution, so the performance of the program may change over time.
To enable this type of functionality, instructions are by default designed to be executed in sequence. Each storage location in memory is associated with an address (essentially, a number), and instructions that are intended to be unconditionally executed in sequence are stored in memory locations having sequentially increasing addresses. The processor might, for example start operating by fetching and then executing the instruction located at memory address 0, followed by fetching and then executing the instruction located at memory address 1, and so on.
In order to change the flow of program execution, branch instructions are introduced. Typically the fetching of a branch instruction causes the processor to test whatever condition(s) is specified by the instruction. If the test outcome is that the condition is not satisfied, then the next instruction is fetched from the memory location that immediately follows the location at which the branch instruction is stored. However, if the test outcome is that the condition is satisfied, then instead of fetching the instruction that immediately follows the branch instruction, an instruction fetch is performed from a non-sequential memory address whose value is in some way specified by the branch instruction.
Pipelines
Through the years, computer engineers have come up with many ways of enabling computers to execute more instructions in less time. Of course, one way is simply to reduce the amount of time it takes to fetch instructions and execute them. Another way is to introduce parallelism into the architecture; that is, to allow different aspects of processing to take place concurrently. One type of architecture that exploits parallelism is a so-called pipelined architecture, in which each instruction is executed in a sequence of stages. As one instruction moves from one stage in the pipeline to the next, another instruction takes its place. When each of the stages has an instruction in it, and with the stages operating in parallel, the amount of time it takes to execute one instruction is effectively the amount of time it spends in one of the stages because, when each stage of a pipeline has a different instruction in it, a new execution result is produced at the end of the pipeline every time an instruction is shifted from one stage to the next.
More particularly an instruction pipeline splits up each instruction into a sequence of dependent steps. Consider an exemplary pipelined processor consisting of the following stages:
Stage 1: Instruction fetch (IF1)
Stage 2: Instruction fetch (IF2)
Stage 3: Instruction decode and register fetch (ID1)
Stage 4: Instruction decode and register fetch (ID2)
Stage 5: Execute (EXE)
One consequence of splitting up instruction processing in this manner is that the effect of an instruction will not be reflected in the architectural state (i.e., performance of the instruction is not yet completed) before the next instruction is fetched. The number of cycles in an unstalled pipeline between the fetching of the instruction and its execution is referred to as the pipeline latency.
If the correct execution of an instruction depends on the result of a previous not yet completed instruction, a pipeline hazard occurs. Hazards can be avoided in both software (by properly re-scheduling instructions) and hardware (by stalling or forwarding).
Branches
Program code is rarely linear and thus, as explained earlier, contains jumps (branches) from one position in the code (branch source) to another position (branch target). Also as explained above, branches can be conditional: the branch is taken when the condition holds, otherwise it is not taken. The branch target and branch condition can be data dependent or data independent. These potential dependencies put constraints on the scheduling freedom of branches.
A branch target can only be fetched once it is known (i.e. the branch has been executed, at which time the branch condition is resolved and the branch target address computed).
The instructions to be executed following execution of the branch (i.e., the target instructions) are dependent on the correct execution of that branch. This means that the fetching of the target instructions to be executed following the branch can only reliably start after the branch has been executed.
FIG. 1a illustrates a code segment that includes a branch instruction. In this example, instructions (Instr) are numbered sequentially. A branch instruction has been placed just after Instr6. In this document, with respect to branches, the following notation is used:                “(NT)” means that a branch execution will result in the branch not being taken (i.e., the next sequentially occurring instruction following the branch will be executed)        “(T)→Instr#” means that the branch condition has been satisfied so that a branch will be taken, and that the target is Instr# (where “#” represents an instruction number)        
It can therefore be seen that, in the Example of FIG. 1a, the illustrated branch instruction has target 101 if the branch is not taken (NT), and target 103 if the branch is taken (T). In this example, branch execution results in the branch being taken, with the target instruction being Instr11. This means that the NT target instructions 101 will not be executed.
FIG. 1b is a processing sequence diagram 150 that illustrates how the branch instruction of FIG. 1a would be processed in the exemplary pipelined processor mentioned above. Each rectangle shows what instruction is contained in a given stage (IF1, IF2, ID1, ID2, EXE) of the pipeline. Time proceeds from left to right in the figure, and is denoted in terms of cycle number.
The example starts in Cycle 0, at which point the contents of stages EXE, ID2, ID1, IF2, and IF1are Instr6, Branch, Instr7, Instr8, and Instr9 respectively. It will be understood that instructions Instr7, Instr8, and Instr9 have been fetched under the assumption that instructions should be fetched from sequential memory addresses unless an executed Branch instruction requires a different action.
In Cycle 1, Instr6 is no longer in the pipeline and each of the remaining instructions has advanced one stage in the pipeline. Although not shown in the Figure, the next sequential instruction, Instr10, has been fetched and loaded into the first stage of the pipeline (i.e., the IF1 stage). The Branch instruction has reached the EXE stage of the pipeline and the contents of stages ID2, ID1, IF2, and IF1 are Instr 7, Instr8, Instr9, and Instr10, respectively.
However, as mentioned above, in this example the branch is to be taken, with the target being Instr11. The pipelined execution of the taken branch in the EXE stage during Cycle 1 means that the already fetched and partially-processed instructions contained in the earlier stages of the pipeline (i.e., stages IF1, IF2, ID1, and ID2) are the NT target instructions 101, and these should not be allowed to change the state of the computer. For this reason, the exemplary pipelined processor is configured not to execute (and thereby not to change the state of the processor) when each of these already fetched instructions reaches the EXE stage of the pipeline. This type of non-execution is called an “idle cycle”. In the example of FIG. 1b, it can be seen that the branch instruction in cycle 1 causes the next four cycles to be idle cycles in the EXE stage, with the next instruction to be executed (Instr 11) not reaching the EXE stage until Cycle 6.
This means that 4 cycles of processor time are essentially wasted, which is an undesirable effect. There are two commonly used ways to prevent the functional units of the processor from becoming idle due to the pipeline having to wait for the execution of the branch: branch delay slots and branch prediction. These are discussed in the following:
Branch Delay Slots
One way of reducing the number of idle cycles associated with a branch taken condition in a pipelined processor is position the branch instruction within the set of instructions such that the sequentially next instructions immediately following branch instruction are instructions that need to be executed regardless of whether the outcome of branch execution is “taken” or “not taken”. This technique is illustrated in FIG. 2. Three similar program segments are shown: “Original program” 250; “Branch with 4 branch delay slots filled” 260; and “Branch with 2 of 4 branch delay slots filled” 270. For each of these, it is assumed that the program is executed by a 5-stage pipelined processor as discussed earlier. For each of the examples, instructions Instr6 through Instr10 are the target instructions if the branch is not taken 201, and the instructions starting with Instr11 are the target instructions if the branch is taken 203.
The Original program 250 is very much like the one shown in FIG. 1a: the illustrated portion begins with five instructions (Instr1, . . . , Instr5) followed by a conditional branch instruction. Following the conditional branch instruction, another seven instructions are depicted (Instr6, . . . , Instr12). In this example, the condition tested by the branch instruction is satisfied, so the branch will be taken, with the target starting at Instr11 (i.e., instructions Instr6 through Instr10 are not to be executed). When this program segment is executed in the pipelined processor, the effect is as shown in FIG. 1b: there will be four idle cycles before the target instruction, Instr11, is executed.
It will be observed that if the compiler were to advance the placement of the branch by 4 instructions, as depicted in the example called “Branch with 4 branch delay slots filled” 260, the pipeline latency would be completely hidden for this branch. This is because when the branch instruction is in the EXE stage of the pipeline, the remaining four stages of the pipeline will be working on instructions Instr2 through Instr5 which, according to the Original program 250, are required to be executed regardless of the outcome of the branch.
When this technique is used, the instruction positions that fill the pipeline stages when the branch instruction is in the EXE stage of the pipeline are called “branch delay slots”. This technique of repositioning the branch instruction to an earlier position within the program code separates the location that the branch instruction occupies in the program code from the branch source (i.e., the position in the code from which execution jumps to another location based on the outcome of the branch instruction). That is, the last branch delay slot is now the branch source.
Thus, branch delay slots are scheduling slots for instructions representing the pipeline latency of their associated branch. The number of branch delay slots is therefore conventionally fixed and equal to roughly the pipeline depth. The branch delay slots are positioned directly after the branch instruction and are always executed, irrespective of the outcome of the branch instruction.
The branch delay slot strategy is not perfect, however, because branches can only be advanced up to the slot immediately following the last instruction that determines the branch behavior. If, in the Original program 250 shown in FIG. 2, execution of the instruction “Instr3” determines the branch behavior (i.e., the state that will be tested in the branch condition), the branch cannot be advanced any earlier than “Instr3” because it would be evaluating the state of a condition that had not yet been determined. In this case, only instructions “Instr4” and “Instr5” can be used to fill the branch delay slots, leaving two unfilled branch delay slots. The unfilled branch delay slots will contain so-called NOPs (“No Operation” instructions—instructions that do not change the state of the computer). This is illustrated in the program segment called “Branch with 2 of 4 branch delay slots filled” 270. Every NOP will lead to an idle functional unit and thus to performance loss.
Branch Prediction
Another way to mitigate the performance loss due to pipeline latency when a branch is performed is to predict the outcome of a branch in advance of the branch's actual time of execution. Ideally, for the exemplary pipelined processor, the target of a branch would be predicted when the branch source is in the IF1 stage. This would allow the branch target to be fetched during the next cycle, and no performance loss would occur.
The prediction of branches is done by a specialized unit: the branch prediction unit (BPU). A BPU contains memories to keep track of the branch information that becomes available once a branch has been executed. When a branch is fetched, the BPU's internal algorithms predict the branch information (what the target instruction is, whether the branch will be taken (i.e., “branch direction”, etc.) based on historical and/or contextual information with respect to this branch. Branch prediction techniques are described in, for example, Scott McFarling, “Combining Branch Predictors”, WRL Technical Note TN-36, Jun. 1993, pp. 1-25, Digital Western Research Laboratory, Palo Alto, Calif., USA.
Having predicted the target and direction of the branch in IF1 , the predicted target is fetched in the very next cycle, so that there need not be any idle cycles regardless of branch outcome if the prediction is correct. FIG. 3 is a processing sequence diagram 300 that illustrates the performance improvement that can be achieved when branch target and direction can be correctly predicted. Using the same exemplary code segment shown as “Original program” in FIG. 2 and an exemplary 5-stage pipelined processor with branch prediction, the processing sequence diagram 300 shows the pipeline in Cycle 1, at which point instructions Instr2, Instr3, instr4 and Instr5 are in pipeline stages EXE, ID2, ID1, and IF2, respectively. Further, the branch instruction has just been loaded into pipeline stage IF1.
In this example, the prediction made in the IF1 stage is that the branch will be “taken”, and the target is Instr11 (denoted in the figure as “T→Instr11”). Accordingly, in Cycle 2, the predicted target instruction (Instr11) is fetched and loaded into the IF1 stage when the instructions in each of the other stages advance one stage in the pipeline. Since there are no other branch instructions in this example, instructions are fetched from sequential memory locations in each of cycles 3, 4, 5, and 6.
The actual evaluation of the branch instruction takes place when the branch instruction reaches the EXE stage in Cycle 5. Assuming that the prediction was correct (i.e., that the branch is taken with Instr11 being the target), the target instruction reaches the EXE stage in the very next cycle (Cycle 6). In this manner, the need for idle cycles has been avoided.
It is noted that technology has not yet advanced to the point at which it is always possible to make perfect branch predictions. For this reason, pipelined architectures continue to suffer from idle cycles even when branch prediction technology is employed.
The inventors have determined that each of the conventional techniques for dealing with idle cycles that result from execution of branch instructions in a pipelined processor falls short in a number of aspects. It is therefore desired to provide improved technology for avoiding the occurrence of idle cycles.