1. Technical Field
The present invention relates to mechanisms for optimizing computer code, and in particular, to mechanisms for improving the performance of software-pipelined loops.
2. Background Art
Software pipelining is a method for scheduling non-dependent instructions from different logical iterations of a program loop to execute concurrently. Overlapping instructions from different logical iterations of the loop increases the amount of instruction level parallelism (ILP) in the program code. Code having high levels of ILP uses the execution resources available on modern, superscalar processors more effectively.
A loop is software-pipelined by organizing the instructions of the loop body into stages of one or more instructions each. These stages form a software-pipeline having a pipeline depth equal to the number of stages (the xe2x80x9cstage countxe2x80x9d or xe2x80x9cSCxe2x80x9d) of the loop body. The instructions for a given loop iteration enter the software-pipeline stage by stage, on successive initiation intervals (II), and new loop iterations begin on successive initiation intervals until all iterations of the loop have been started. Each loop iteration is thus processed in stages through the software-pipeline in much the same way that an instruction is processed in stages through a processor pipeline. When the software-pipeline is full, stages from SC sequential loop iterations are in process concurrently, and one loop iteration completes every initiation interval. Various methods for implementing software-pipelined loops are discussed, for example, in B. R. Rau, M. S. Schlansker, P. P. Tirumalai, Code Geiteration Schema for Modulo Scheduled Loops IEEE MICRO Conference 1992 (Portland, Oreg.) and in, B. R. Rau, M. Lee, P. P. Tirumalai, M. S. Schlansker, Register Allocation for Software-pipelined Loops, Proceedings of the SIGPLAN ""92 Conference on Programming Language Design and Implementation, (San Francisco, 1992).
The initiation interval (II) represents the number of processor clock cycles (xe2x80x9ccyclesxe2x80x9d) between the start of successive iterations in a software-pipelined loop. The minimum II for a loop is the larger of a resource II (RSII) and a recurrence II (RCII) for the loop. The RSII is determined by the availability of execution units for the different instructions of the loop. For example, a loop that includes three integer instructions has a RSII of at least two cycles on a processor that provides only two integer execution units. The RCII reflects cross-iteration or loop-carried dependencies among the instructions of the loop and their execution latencies. If the three integer instructions of the above-example have one cycle latencies and depend on each other as follows, inst1xe2x86x92inst2xe2x86x92inst3xe2x86x92inst1, the RCII is at least three cycles.
RSII and RCII are illustrated for the following code segment, which includes instructions from the IA64(trademark) instruction set architecture (ISA) of Intel(copyright) Corporation of Santa Clara, Calif.:
Here, (V19) and (V17) operate as predicates to gate the instructions that follow on and off.
Code segment (I) has an RSII of 3 cycles and an RCII of 5 cycles on an Itanium(trademark) processor of Intel(copyright) Corporation. The RCII is determined by the chain of dependence edges (9)xe2x86x92(5)xe2x86x92(6)xe2x86x92(7)xe2x86x92(9), assuming a 2 cycle latency for instruction (6) (ld1) and a one cycle latency for the remaining instructions. The RSII is determined by the execution resources provided by the Itanium(trademark) processor.
A software-pipelined loop has its maximum ILP when its RCII is less than or equal to its RSII. This is difficult to achieve for loops that include control flow operations within the loop body. Control flow operations are often implemented through predicates that are evaluated by compare instructions (CMPs), and available compilers do not allow these CMPs to be speculated. An instruction is speculated when it is executed before the processor determines that the instruction needs to be executed. In software-pipelined loops, instructions from multiple loop iterations execute in parallel, and instructions from later iterations may be executed unnecessarily if the loop terminates at an earlier iteration. Speculating a CMP within a software-pipelined loop entails significant overhead to ensure that any non-speculative operations gated by a speculated CMP are canceled if the iteration containing the speculated CMP is not reached.
In code segment (I), for example, instruction (5) (CMP) determines a predicate value, V19, which activates/deactivates instructions (6) through (9), and instruction (9) (CCMP) determines whether the loop repeats or terminates. A conventional compiler includes loop-carried dependence edge (9)xe2x86x92(5) in the data dependence graph (DDG) for code segment (I). The loop-carried edge ensures that when code segment (I) is modulo-scheduled, CMP for the nth loop iteration does not execute until CCMP for the (nxe2x88x921)st iteration determines that the nth iteration is reached. This strategy simplifies bookkeeping for software-pipelined loops, but it may also lead to unnecessarily large RCIIs for the loops, which can reduce performance.
The present invention addresses these and other problems associated with software-pipelined loops.