Software pipelining for programmable, very long instruction word (VLIW) computers is a technique for introducing parallelism into machine computation of software loops. If different parts of the software loop use different hardware resources, the computation of one iteration of the loop may be started before the prior iteration has finished, thus reducing the total computation time. In this way several iterations of the loop may be in progress at any one time. In machines controlled by VLIW instructions, the instructions in the middle of the loop (where the pipeline is full) are different from the instructions at the start of the loop (the prolog) and the instructions at the end of the loop (the epilog). If a computation requires a number of different loops, a relatively large amount of memory is required to store instructions for the epilog and prolog portions of the loops.
Software pipelining for programmable VLIW machines, such as the IA-64, is accomplished by predicating instructions and executing them conditionally as the software pipeline fills and drains. The predication mechanism tags instructions with a predicate that conditions execution and committing of results to the register file in a general-purpose processor. This approach is generalized in these processors because the prediction mechanism is also used for general conditional execution. A disadvantage of this technique is the requirement for a centralized predicate register file.
Loop-unrolling is a common technique to improve the throughput of inner loops. This unrolling increases efficiency in processors having multiple functional units and also allows overlapping of various operational latencies. However, loop-unrolling has shortcomings when the number of input data items is not a multiple of the unrolling factor. This is exacerbated if the one iteration of the calculation uses a value calculated in a different iteration. This is called a cross-iteration dependency.
Loops that accumulate (e.g. dot products) cannot be unrolled for higher concurrency unless the multiple inputs to the accumulator can be managed. For example, if the calculation of a dot product is broken into two parts, the code will not work when an odd number of iterations are required. Previously, loop unrolling techniques use epilog instructions to deal with the residual computations, as described above.