Many computer programs spend a significant percent of their total execution time executing in loops. Therefore, reducing the time required to execute the loops in a program can dramatically reduce the time required to execute an entire program. One such loop optimization technique, called software pipelining, aims to reduce loop execution time by overlapping the execution of several iterations of the loop, thus increasing instruction-level parallelism. Specifically, modulo scheduling algorithms periodically begin the execution of a loop iteration before previous loop iterations have completed executions, starting a new loop iteration once every initiation interval (II) processor cycles. Thus, the throughput is one iteration every II cycles after some initial startup cycles.
Consider the examples depicted in FIGS. 1A and 1B. In both examples, the loop body 102 consists of four single-cycle latency instructions I1, I2, I3 and I4. In FIG. 1A, there are four iterations of the loop (Iter1 to Iter4), with a new iteration beginning each cycle. In cycle zero, I1 of the first loop iteration executes. In cycle one, I2 of the first loop iteration executes along with I1 of the second loop iteration, and so on. Notice that in cycle three, each of the four instructions in the loop is executed, although on behalf of four different iterations of the loop. When each of the instructions in the loop is executed in the same cycle on behalf of different loop iterations (or in the same stage, as will be discussed later), it is called the kernel 104. In cycle four, the first loop iteration is complete, and I4, I3, and I2 of the second, third, and fourth loop iterations execute, respectively.
In FIG. 1B, five loop iterations need to be executed, beginning a new iteration each cycle. Notice that the code executed in cycle four is identical to the code executed in cycle three. Thus to execute more loop iterations, the kernel 104 need only be executed additional times. In particular, notice that the code leading up to the kernel 104 and away from the kernel is the same for both FIG. 1A and FIG. 1B.
FIG. 2 depicts the code for any loop with four instructions (I1, I2, I3 and I4) and an iteration count N greater than three. The code that builds up to the kernel 104 is called the prologue 202 and the code that finishes the iterations after the kernel 104 is called the epilogue 204. While the kernel itself contains a single copy of each instruction in the loop, the epilogue and prologue contain multiple copies, thus dramatically increasing the total size of the code.
Conventional approaches to reducing the code size problems associated with the prologue and epilogue are unsatisfactory. For example, the prologue and epilogue may be omitted in certain architectures by the heavy use of predication. However, extensive support for predication often dramatically increases hardware cost and design complexity of the architecture. Alternatively, with some clever data layout and instruction scheduling (specifically of the backward branch instruction), portions of the prologue and epilogue may be eliminated, but this is not always possible.
FIG. 3 depicts the loop instructions from the view of the functional units, rather than from the view of the iterations. As shown in FIG. 3, instruction I1 of the first loop iteration (surrounded by line 302) is executed in cycle zero by functional unit 0 (FU0). In cycle one, the execution of I2 of the first iteration is not executed in FU0 but rather in FU1, while functional unit FU0 begins the processing of I1 for the second loop iteration. Execution of the loop occurs in steps, where the resulting data from one functional unit is handed off to another, each performing a little more work on the iteration. Hence, the name “Software Pipelining” is given to this type of instruction scheduling.
Often, an initiation interval (II) of one cycle is not achievable due to scheduling constraints. FIGS. 4A and 4B depict a loop that has an II of two cycles, thus a new iteration of the loop is started once every two cycles. The loop is broken up into stages of II cycles in each stage (the stages for the first loop iteration 402 are labeled in the figure). For example, while executing stage 2 of the first loop iteration 402, stage 1 of the second loop iteration 404 begins executing. Furthermore, it is not a requirement that only one instruction be executed in each cycle for a given loop iteration. For example, as shown in FIG. 4A, two instructions 406 are executed by functional units 3 and 4 in the second cycle of the first stage, both on behalf of the same loop iteration. Likewise, no instructions are executed for any functional units in the second cycle of the second stage 408. However, as can be seen, even though no resources are used for the second cycle in the second stage 408 for the first loop iteration 404, the functional units are executing instructions associated with the second cycle of the first stage of second loop iteration 404.
The compiler physically generates the prologue, kernel, and epilogue code, for the loop to be executed by the functional units as shown in FIG. 4B. The total size of the code for executing the loop is roughly equivalent to the number of overlapped loop iterations in the kernel 104 times the number of instructions 402 in a single loop iteration, or alternatively the number of instructions in the kernel 104 times the number of stages. Furthermore, if the iteration count of the loop is ever less than the number of concurrent iterations in the kernel, then for traditional modulo scheduling, the compiler needs to generate extra code to skip over the kernel into the epilogue from the prologue.
Kernel-only modulo scheduling is a technique that eliminates the need for an explicitly generated prologue and epilogue through the use of predication and a rotating register file in the processor architecture. With predicated execution, each instruction is conditionally executed based on the value of a Boolean predicate. Just prior to execution of an instruction, the register that contains the Boolean predicate is read, and if true the instruction executes. Otherwise, the instruction is nullified and turned into a no operation (nop). In kernel-only modulo scheduling, each stage of execution is assigned a single predicate register that can be set to true or false, depending on whether the stage should executed in the cycle or not. Specifically, the instructions associated with the stage are conditionally executed based their stage's predicate. See, for example, B. R. Rau and C. D. Glaeser, “Some Scheduling Techniques and An Easily Schedulable Horizonal Architecture for High Performance Scientific Computing,” in Proceedings of the 20th Annual Workshop on Microprogramming and Microarchitecture, pp. 183-198. October 1981: and J. C. Dehnert, P. Y. Hsu, and J. P Bratt. “Overlapped Loop Support in the Cydra 5,” in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26-38, April 1989.
Considering the first loop iteration 102 in FIGS. 1A and 1B, suppose that instructions I1 through I4 are predicated on Boolean predicate variables P1 through P4 respectively. Then cycle 0 of the prologue can be executed by actually executing the kernel 104 with P1 set to true, while the others are set to false. Cycle 1 of the prologue can be executed by executing the kernel 104 with P1 and P2 set to true, while the others are set to false. Likewise, the rest of the prologue and the epilogue can be executed by executing only the kernel 104 with the appropriate Boolean predicates set to true.
In kernel-only modulo scheduling, either the modulo scheduled software loop itself can maintain and update the appropriate predicates to execute the prologue and epilogue from the kernel, or the hardware can do so through the use of a rotating register file. With such hardware, the Boolean predicate for the first stage must be set by the software, but the hardware automatically sets the predicate to true for the second stage for the next kernel iteration. In other words, by setting the predicate to true for the first stage, the hardware automatically enables the remaining stages to complete the loop iteration during successive executions of the kernel. This method relies heavily upon the architectural technique of predication, and optionally requires the use of a rotating predicate register file for efficient implementation. However, only the kernel with appropriate predicated instructions must be present in the program, meaning the code size of the loop is roughly equal to that of a single iteration plus the cost of encoding the Boolean predicate identifiers with each instruction. In addition, code size expansion may result from encoding the rotating register file specifier extensions or explicit instructions for manipulating the predicates.
Unroll-based scheduling is also a technique for overlapping the execution of loop iterations. In this method, several iterations of the loop are unrolled (or duplicated) and placed after one another in the program. This new, longer sequence of instructions forms a new loop (called an unrolled loop) in which multiple iterations of the original loop are executed in each iteration of the unrolled loop. Consequently, the number of iterations of the unrolled loop is reduced in comparison with the number of iterations of the original loop. In order to reduce the time required to execute the loop, the instructions from the various original iterations now in the unrolled loop can be mixed and scheduled together, effectively overlapping iterations.
This technique differs from modulo scheduling in that loop iterations are not periodically initiated. Rather, a number of loop iterations are started at the beginning of the unrolled loop, then all allowed to complete before a new set of iterations are allowed to begin (a new iteration of the unrolled loop). Typically, the various loop iterations placed in the unrolled loop cannot be perfectly mixed together, resulting in idle functional units toward the beginning and ending of the unrolled loop body. Essentially, no original loop iterations are executing across the back edge of the unrolled loop. By waiting for all of the original iterations in the unrolled loop to complete before starting a new set of iterations, cycles are wasted that could be used to begin new iterations. The modulo scheduling technique is capable of starting new iterations when functional units are available, albeit at the cost of increased analysis and scheduling complexity within the compiler. Unroll-based scheduling is a competing approach to modulo scheduling.
Zero-overhead loop buffers are a hardware technique for executing loop iterations without having to fetch the loop body from the memory system for each iteration by storing a copy of the loop body in a dedicated buffer, thus reducing power and simplifying the instruction fetch process. The technique also is combined with special branch instructions that manage the remaining loop iteration count in the hardware (called hardware loop semantics), without requiring instructions in the loop body to maintain the count. This method is capable of supporting loops generated by the unroll-based scheduling technique, and possibly modulo scheduled loops if predication support is available.
What continues to be needed in the art, therefore, is a method and apparatus that enables overlapping execution of loop iterations in a processor architecture without expanding the size of the code beyond the size of the kernel for executing the prologue and epilogue or for accounting for loop iterations less than the number of concurrent iterations in the kernel, and without dramatically increasing hardware cost and design complexity of the architecture. The present invention fulfills this need, among others.