Computers known as pipeline processors process instructions in stages. Instruction processing on a pipeline processor is analogous to a pipeline, with each consecutive stage processed in an adjacent stage of the pipeline. Pipeline processors generally send each instruction into the pipeline while instructions preceding it are still in the pipeline.
The capability of a pipeline processor to execute instructions in parallel is exploited fully when all of its stages are kept active. A set of one or more adjacent inactive stages on a pipeline is called a bubble. An operation called a conditional branch can create bubbles. A conditional branch operation comprises a block of instructions which physically follow an instruction to jump to the instruction immediately following the block upon the existence of a specified condition. The block of instructions is called a conditional block. The instruction to jump is called a branch instruction. The condition depends on the results of one or more instructions (called dependent instructions) preceding the conditional branch.
Conditional branches create bubbles as follows. The pipeline processor must finish processing the branch instruction before it can determine the next instruction to enter the pipeline. Therefore, no new instructions enter the pipeline until the branch instruction leaves the pipeline. Accordingly, a bubble the length of the entire pipeline can form following the conditional branch.
Many pipeline processors have bubble-reducing hardware for reducing or eliminating bubbles. Such pipeline processors are capable of overlapping the processing associated with a branch instruction and other instructions independent of the branch (called independent instructions).
FIG. 1 shows block diagram of an example a pipeline processor with bubble reducing hardware. Specifically, the pipeline processor of FIG. 1 reduces bubbles by employing parallel execution units comprising a branch processor 114 and two instruction processors 115.
Referring to FIG. 1, an instruction cache 110 retrieves instructions from a main memory 112 and sends them to a branch processor 114. The branch processor 114 executes branch instructions and dispatches arithmetic instructions to arithmetic processors 115 such as a fixed point processor 116 and a floating point processor 118. The fixed point processor 116 executes arithmetic instructions involving fixed point data. The floating point processor 118 executes arithmetic instructions involving floating point data. The arithmetic instructions could be computation instructions or comparison instructions.
The arithmetic processors 115 store results of computation instructions to a data cache 122, from which the data is stored into the main memory 112. The arithmetic processors 115 set a condition register 120 in the branch processor 114 to indicate the results of comparison instructions.
The separate execution units for processing branch and arithmetic instruction enable the parallel processor of FIG. 1 to overlap the processing associated with a branch instruction and arithmetic instructions as follows. The branch processor 114 dispatches arithmetic instructions and processes some branch instructions without waiting for the arithmetic units to execute instructions which it has previously dispatched to them. Because the branch processor 114 dispatches instructions faster than the arithmetic processors 115 executes them, there are generally several instructions which have been dispatched but have not yet been executed. Accordingly, if the branch processor 114 temporarily stops dispatching instructions, the arithmetic processors 115 execute previously dispatched instructions.
However, the architecture of the pipeline processor of FIG. 1 introduces another possible type of delay when processing conditional branches. Specifically, the branch processor 114 cannot process a branch instruction until the arithmetic processor 115 has set the condition register 120 for the comparison on which the branch depends. Therefore, if a branch instruction closely follows the comparison on which it depends, a bubble can form in front of the conditional branch.
The bubble reducing hardware for reducing or eliminating bubbles following conditional branches is ineffective unless instructions are provided, the processing of which can be overlapped with the processing associated with the conditional branch. Various optimization techniques provide such instructions. A first such optimization technique involves performing a dependency analysis on instructions, and rearranging the order of instructions so that they can be overlapped. On the pipeline processor of FIG. 1, for example, the rearranged instructions could execute while the branch processor 114 is waiting for the condition register 120 to be set. The rearranged instructions could also be executed while the branch processor 114 is processing the branch instruction.
The amount that the first optimization technique reduces bubbles before and following a conditional branch instruction is proportional to the processing time of the independent instructions. Therefore, the first optimization technique will not completely eliminate the bubbles if there are insufficient independent instructions that it can place between the dependent instructions and the conditional branch. The latter is often the case in a compute-compare-branch loop. A compute-compare-branch loop is a loop which contains a compute-compare-branch sequence, that is, a sequence of instructions to execute one or more computations, execute a comparison dependent on the computations, and execute a conditional branch conditioned on the comparison. Many computers spend a significant amount of time processing compute-compare-branch loops.
Execution of the compute-compare-branch loop is as follows:
______________________________________ compute.sub.1 compare.sub.1 condition.sub.1 compute.sub.2 compare.sub.2 condition.sub.2 . . . compute.sub.n compare.sub.n condition.sub.n ______________________________________
In the above representation, and in the representations below, only the compute-compare-branch sequences of the compute-compare-branch loop are shown. The loop could contain additional instructions. Also, compute.sub.i represents the computation operation of the i.sup.th iteration of the sequence, compare.sub.i represents the comparison operation of the i.sup.th iteration, and condition.sub.i represents the conditional branch operation of the i.sup.th iteration. As used in this document, operation means a task which is performed by the execution of one or more machine instructions.
An example of a compute-compare-branch loop would be a loop that computed a value of a variable x, compared x to a variable Xmax which represents the greatest value of x computed so far, and assigns Xmax to x if x is greater than Xmax. Pseudocode for the example loop is as follows:
______________________________________ begin loop compute x compare x to Xmax if (x &gt; Xmax) then Xmax &lt;- x end if end loop ______________________________________
FIG. 2A shows two iterations of a compute-compare-branch sequence on the pipeline processor of FIG. 1. In FIG. 2A, as well as in FIG. 2B, blocks in the right-hand column represent the activities of the branch processor 114, and blocks in the left-hand column represent the activities of one of the arithmetic processors 115. Time proceeds from the top of the columns to the bottom. Each arrow indicates a dependency of the operation to which it points on the operation from which it points. Each arrow pointing from right to left represents a delay between the dispatch of an operation from the branch processor 114 and the execution of that operation by the arithmetic processor 115. Each arrow pointing from left to right represents a delay between the execution of a comparison instruction by the arithmetic processor 115 and the setting of the condition register 120 in the branch processor 114. Each bubble indicates a period of inactivity in the arithmetic processor 115.
Looking at FIG. 2A, the branch processor 114 dispatches compute.sub.1 in block 202. Immediately thereafter, the branch processor 114 dispatches compare.sub.1 in block 204. The branch processor 114 then attempts to execute conditions.sub.1. However, it cannot do so until the condition register is set for compare.sub.1. The condition code will be set after the delay indicated by arrow 214 from block 212.
Meanwhile, the arithmetic processor 115 processes the first iteration of the sequence as follows. After the delay indicated by arrow 206 from block 202, it receives compute.sub.1 from the branch processor 114 and executes it in block 208. After the delay indicated by arrow 210 from block 204, the arithmetic processor 115 receives compare.sub.1 from the branch processor 114 and executes it in block. 212.
Although the operations of blocks 208 and 212 depend on the operations of blocks 202 and 204, there would be no bubble in the arithmetic processor 115 between block 208 and block 212 for the following reasons. First, the branch processor 114 dispatches compare.sub.1 immediately after dispatching compute.sub.1. Second, dispatching an operation takes no longer than executing it. Third, the amount of time it takes for a dispatched instruction to reach the arithmetic processor 115 is constant. Therefore, the arithmetic processor 115 will have received compare.sub.1 by the time it has finished executing compute.sub.1 .
After the delay indicated by arrow 214 from block 212, the branch processor 114 executes branch.sub.1 in block 216. Branch.sub.i represents the branch instruction of the i.sup.th iteration. Assuming the condition of condition.sub.1 is not met, the branch processor 114 next dispatches compute.sub.2 in block 218. Immediately thereafter, it dispatches compare.sub.2 in block 220. The branch processor 114 cannot execute condition.sub.2 until the condition register is set for compare.sub.2.
After the sum of the amount of the delay indicated by arrow 214 from block 212, the execution time of block 216 and the delay indicated by arrow 222 from block 218, the arithmetic processor 115 receives compute.sub.2 from the branch processor and executes it in block 224. This sum is the amount of time between the execution of blocks 212 and 224 and is represented by bubble 225.
After the delay indicated by arrow 226 from block 220, the arithmetic processor 115 receives compare.sub.2 from the branch processor 114 and executes it in block 228. Although the operations of blocks 224 and 228 depend on the operations of blocks 218 and 220, there is no bubble between block 224 and block 228 for the same reason there is no bubble between blocks 208 and 212.
After the delay indicated by arrow 230 from block 228, the branch processor 114 executes branch.sub.2 in block 232. Assuming the condition of branch.sub.2 is met, the branch processor 114 next dispatches conditional.sub.-- block in block 234. Conditional.sub.-- block represents the conditional block associated with branch.sub.2.
After the sum of the amount of the delay indicated by arrow 230 from block 228, the execution time of block 232 and the delay indicated by arrow 236 from block 234, the arithmetic processor 115 receives conditional.sub.-- block from the branch processor and executes it in block 238. This sum is the amount of time between the execution of blocks 228 and 238 and is represented by bubble 240.
A second optimization technique, called loop unrolling, can reduce bubbles resulting from a conditional branch in a compute-compare-branch loop. Loop unrolling is performed by combining the instructions that would have been executed in two or more iterations of the original loop into each iteration of an unrolled loop. The number of iterations of the unrolled loop is reduced accordingly. An unrolled loop which contains the instructions of i iterations of the original loop in each iteration is said to be unrolled to a level of i. Unrolling the loop provides additional independent instructions. The method of the first optimization technique can then be performed to rearrange the unrolled loop so that execution of the independent instructions overlaps the processing of the instructions associated with the conditional branch.
A compute-compare-branch loop unrolled to a level of two executes essentially as follows:
______________________________________ compute.sub.1 compare.sub.1 compute.sub.2 compare.sub.2 condition.sub.1 condition.sub.2 compute.sub.3 compare.sub.3 compute.sub.4 compare.sub.4 condition.sub.3 condition.sub.4 . . . compute.sub.(n-1) compare.sub.(n-1) compute.sub.n compare.sub.n condition.sub.(n-1) condition.sub.n ______________________________________
The above representation shows only the compute-compare-branch sequences of a loop which could contain additional instructions. The subscripts in the above representation indicate the iteration of the original loop with which the computations, comparisons and conditional blocks are associated.
Note that if the original number of iterations is odd, execution is slightly different. For example, the first iteration of the original loop could be executed explicitly before the loop. The loop would then execute essentially as follows:
______________________________________ compute.sub.1 compare.sub.1 condition.sub.1 compute.sub.2 compare.sub.2 compute.sub.3 compare.sub.3 condition.sub.2 condition.sub.3 compute.sub.4 compare.sub.4 compute.sub.5 compare.sub.5 condition.sub.4 condition.sub.5 . . . compute.sub.(n-1) compare.sub.(n-1) compute.sub.n compare.sub.n condition.sub.(n-1) condition.sub.n ______________________________________
Whether loop unrolling optimization provides enough independent instructions to eliminate bubbles depends on the nature of the specific loop unrolled. For example, in processing an iteration of an unrolled loop having the operations of original loop iterations i and i-1 on the pipeline processor of FIG. 1, loop unrolling would reduce the bubble preceding condition.sub.(i-1) by the amount of time it would take the arithmetic processor 115 to process compute.sub.i and compare.sub.i. It would reduce the bubble preceding condition.sub.i by the amount of time it would take the arithmetic branch processor 115 to process condition.sub.(i-1).
Loop unrolling optimization has several potential weaknesses. First, if the conditional block associated with condition.sub.(i-1) of the compute-compare-branch sequence is either rarely executed or requires little time to execute, loop unrolling would not substantially reduce the bubble associated with condition.sub.i. Because condition.sub.i represents half of the conditional branches in the loop, the performance benefits of unrolling such a loop would be limited.
Furthermore, loop unrolling optimization can be performed only on a compute-compare-branch sequence which executes on every iteration of the compute-compare-branch loop. Such a sequence is called an unconditional compute-compare-branch sequence. Loop unrolling therefore does not reduce bubbles associated with a compute-compare-branch sequence within a conditional block in a compute-compare-branch loop. Such a sequence is called a conditional compute-compare-branch sequence.
Although the problem of delays associated with conditional branches has been explained in the context the parallel processor of FIG. 1, the problem extends to all pipeline processors.
What is needed, therefore, is an optimization technique which improves pipeline processor efficiency in processing a compute-compare-branch loop in which the average computation time of the conditional block associated with the compute-compare-branch sequence is small or in which the compute-compare-branch sequence is conditional.