The ability to software pipeline computer program loops is well-known in the art and is essential to achieving good performance on Very Long Instruction Word (VLIW) computer architectures. In VLIW architectures, a compiler packs a number of single, independent operations into the same instruction word. When fetched from cache or memory into a processor, these words are easily broken up and the operations are dispatched to independent execution units. VLIW can perhaps best be described as a software- or compiler-based supercoder technology. A program loop consists of multiple iterations of the same instructions in a software program. Without software pipelining, the first iteration of a loop is completed before the second iteration is begun, and the second iteration of the loop is completed before the third iteration is begun, etc. The following is an example of a typical FOR loop, where when the loop begins n represents the number of desired iterations:
loop: ;FOR loopins1ins2ins3dec n ;n = n−1[n] br loop;branch to loop if n>0
In the absence of software pipelining and assuming dependence constraints are met, a possible “schedule” for the code on VLIW processor might, for these instructions (ins1, ins2, ins3), be as follows:
loop:ins1ins2∥ ec n ; n=n−1ins3∥ [n] br loop ; branch to loop if n>0(Note: The ∥ operator denotes instructions that execute in parallel.) 
To be most efficient, the source code corresponding to program loops should be compiled to take advantage of the parallelism of VLIW architectures. The software pipelining optimization has been used extensively to exploit this parallel processing capability by generating code instructions for multiple operations per clock cycle.
With software pipelining, iterations of a loop in a source program are compiled in such a way that when the program is run the iterations are continuously initiated at constant intervals without having to wait for preceding iterations to complete. Thus, multiple iterations, in different stages of their computations, are in progress simultaneously across multiple parallel processors.
Software pipelining thus addresses the problem of scheduling the operations within an iteration, such that the iterations can be pipelined to yield optimal throughput. See Monica Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,” Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation (1988). Care must be taken that additional iterations are not initiated once the end-condition of the loop is met. With the FOR loop above, extra future iterations may be prevented because it is easy to anticipate when the loop will end (i.e., the loop will end when n=0, and because n is consistently decremented we can anticipate the value of n for a given iteration).
A software pipelining of the FOR loop is listed below. The set of parallel instructions immediately following the label “kernel” is executed repeatedly until the final iteration is started.
loop: sub n,2,n;execute kernel n−2 times ins1;prolog stage 1 ins2∥ ins1∥;prolog stage 2 ;------------------------------------------kernel ins3∥ ins2∥ ins1∥ [n] decn∥ [n] br kernel ;------------------------------------------ ins3∥ ins2;epilog stage 1 ins3;epilog stage 2
In the pipeline code above, the three-cycle loop becomes one-cycle loop by paralleling consecutive iterations of the loop. The kernel of the loop acts as a pipeline, processing one “stage” of each of the iterations in parallel. The pipeline is primed through the prolog code and drained through the epilog code which surrounds the kernel. The size of the kernel may be referred to as the “iteration interval” (II). In the example above, the II is 1.
In some cases, each stage, including the kernel, consists of multiple cycles. For example, this may be due to hardware restrictions such as the need to perform three multiplication operations when there are only two multipliers available. To accomplish this, two multiplications would be performed in parallel in one cycle of the kernel, and the third multiplication would be performed during the other cycle.
The kernel size may also be increased because of loop carried data dependences in the loop being software pipelined. In general, an instruction in a future iteration cannot be issued until all results that it needs from previous iterations have been computed.
In the example above, a given iteration of the FOR loop begins in the kernel while the previous two iterations are still being executed. Since two iterations start before the kernel is reached, the kernel only needs to be executed n-2 times, so at the beginning of the loop code n is set equal to n-2. Specifically, if m represents the number of iterations started, or the trip count, ins1 begins a new iteration m while simultaneously ins2 executes in the middle of iteration m-1 and ins3 executes to finish iteration m-2. However, once the final desired iteration begins (i.e., when n=0), care must be taken so that no new iterations are initiated in the following two clock cycles while the m-1 and m iterations complete. In other words, ins1 must not execute again. As shown above, this can be accomplished by unrolling the last two iterations of the pipelined loop and emitting only the instructions necessary to complete the m-1 and m iterations already in progress.
In contrast, for arbitrary condition loops such as WHILE loops or REPEAT-UNTIL loops there is no way to anticipate that the loop has begun its last iteration until the condition changes. As a result, using the software pipelining technique as described above may result in the initiation of additional iterations after the cycle in which the loop's end-condition was met. Accordingly, there is a danger that instructions executed in any additional iterations will, for example, change values that should have been finalized within the proper number of iterations.
Traditionally, only a restricted set of regular FOR loops could be pipelined. The reason is that code must be generated (or hardware must be used) to pipe down the loop (empty out the pipeline) near the end of the loop. To do so, it must be possible for the compiler or hardware to determine how many iterations in the loop remain.
More recently, it has become known in the art how to use special-purpose hardware to support pipelining of a more general class of loops know as WHILE loops. A WHILE loop is defined to be a loop which is entered at the top and exited at the bottom or can be transformed into such. Moreover, the sequence of execution of the loop body must match the static ordering of the instructions within the body. WHILE loops cover a very general class of loops which subsume FOR loops. See Tirumalai, et al., “Parallelization of WHILE Loops on Pipelined Architectures” in The Journal of Supercomputing, Vol. 5, 119-136, Kluwer Academic Publishers (1991). For many applications, however, this hardware is expensive in terms of cost or power or simply not available.
Software pipelined loops, such as the one depicted previously generally have a minimum trip count requirement. In particular, they can only be safely executed if the trip count is greater than or equal to the number of concurrently executed iterations in the steady state. For the previous example, that number is 3. The reason is that the shortest through this loop would cause three iterations to be executed.
However it is known in the literature how to handle the case where the compiler has insufficient knowledge to guarantee this safety criteria for a software pipelined loop. This problem is handled at compile time by generating two versions of the loop and using a run-time trip count check to choose between them:
if (n >= min required trip count)pipelined versionelseoriginal versionendif
This technique is referred to as multiversion code generation. Unfortunately, it has the negative side effects of increasing code size and adds run time overhead. See Monica Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,” Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation (1988).