1. Technical Field
The present invention relates to mechanisms for optimizing computer code, and in particular, to mechanisms for improving the performance of software pipelined loops.
2. Background Art
Software pipelining is a method for scheduling non-dependent instructions from different logical iterations of a program loop to execute concurrently. Overlapping instructions from different logical iterations of the loop increases the amount of instruction level parallelism (ILP) in the program code. Code having high levels of ILP uses the execution resources available on modern, superscalar processors more effectively.
A loop is software pipelined by organizing the instructions of the loop body into stages of one or more instructions each. These stages form a software pipeline having a pipeline depth equal to the number of stages (the xe2x80x9cstage countxe2x80x9d or xe2x80x9cSCxe2x80x9d) of the loop body. The instructions for a given loop iteration enter the software pipeline stage by stage, on successive initiation intervals (II), and new loop iterations begin on successive initiation intervals until all iterations of the loop have been started. Each loop iteration is thus processed in stages through the software pipeline in much the same way that an instruction is processed in stages through a processor pipeline. When the software pipeline is full, stages from SC sequential loop iterations are in process concurrently, and one loop iteration completes every initiation interval. Various methods for software pipelining loops are discussed, for example, in B.R. Rau, M.S. Schlansker, P.P. Tirumalai, Code Generation Schema for Modulo Scheduled Loops IEEE MICRO Conference 1992 (Portland, Oregon).
FIG. 1A is a schematic representation of a software pipeline as it processes an exemplary loop through 100 iterations (trip count=100). The instructions of the loop are organized into five stages, labeled A through E (SC=5), and each stage is indexed by the logical iteration to which it corresponds. For example, A(1) is the first iteration of the loop body instruction(s) in stage A, and D(98) is the 98th iteration of the loop body instruction(s) in stage D. The software pipeline fills during a prolog phase 120, as stages from the first iterations of the loop begin on successive initiation intervals, II(1)-II(4). The software pipeline is full during a kernel phase 130 from II(5) through II(100), and it empties during an epilog phase 140, as the last four iterations complete during II(101)-II(104).
In practice, the different stages of the loop body do not physically traverse a pipeline, as indicated in FIG. 1A. For example, a software pipelined loop that is implemented through predication provides all stages of the loop on each initiation interval. The pipeline filling/emptying behavior is effected by predicates associated with the stages (xe2x80x9cstage predicatesxe2x80x9d).
FIG. 1B represents the software pipeline of FIG. 1A in compact, predicated form. Here, stages A through E are associated with predicates p1 through p5, respectively. Setting the predicates p1 through p5 true on successive initiation intervals activates the associated stages and replicates the pipeline filling behavior of prolog phase 120. Similarly, setting predicates p1 through p5 false on successive initiation intervals, deactivates the associated stages and replicates the pipeline emptying behavior of epilog phase 140. During kernel phase 130, all predicates are true, and an instruction stage from each of five sequential loop iterations is active.
Predication thus simplifies the scheduling issues associated with software pipelined loops. All stages of the software pipeline are present in each initiation interval. Only instructions in those stages that are activated by their associated stage predicates update the architectural state of the processor (xe2x80x9cprocessor statexe2x80x9d) when they retire. Instructions in stages that are deactivated by their associated stage predicates are treated as no-operations (NOPs), and any results they generate do not update the processor state.
The dynamics of a software pipelined loop depend, in part, on the relative sizes of its trip count (TC) and its stage count (SC). A kernel phase, in which a new iteration begins and one iteration ends on each iteration interval, is present only for loops in which the trip count is greater than the stage count. For loops having relatively short trip counts, e.g. those in which the number of iterations is less than SC, the software pipeline never fills because there are more stages than there are loop iterations to fill the stages. Consequently, a short trip count loop does not have a kernel phase.
Independent of the number of iterations and stages, prolog and epilog phases 120, 140, respectively, represent overhead associated with filling and emptying the software pipelined loop. The empty stages in FIG. 1A during prolog and epilog phases 120 and 140, respectively, or the deactivated stages in FIG. 1B, represent unused processor resources. These unused resources can lead to significant efficiency losses when the software pipelined loop is nested within an outer loop, because the overhead is incurred on each iteration of the outer loop.
FIG. 1C represents a nested loop 170 that includes the software pipelined loop of FIG. 1A within a counted outer loop. FIG. 1D represents a nested loop 180 that includes a short trip count loop within a counted outer loop. In both cases, overhead associated with inner loop prolog and epilog phases 120 and 140, respectively, is incurred on each iteration of the outer loop. Between each iteration of the inner loop, the outer loop implements any other instructions in its loop body, including those necessary to set up the inner loop for execution.
The present invention addresses these and other inefficiencies related to implementing nested loops using software pipelining techniques.
The present invention supports efficient processing of nested loops that include a software pipelined inner loop within an outer loop. The inner loop is executed without draining its software pipeline until the final iteration of the outer loop is reached. This eliminates the overhead associated with prolog and epilog stages of the inner loop for all but the first and last iterations, respectively, of the outer loop.
In accordance with the present invention, the software pipelined inner loop of the nested loop is executed in a xe2x80x9cno-drainxe2x80x9d mode for the first Nxe2x88x921 iterations of an N-iteration outer loop. For the last iteration of the outer loop, the software pipelined inner loop is executed in a xe2x80x9cdrainxe2x80x9d mode.
For one embodiment of the present invention, no-drain mode is established by initializing an epilog counter for the inner loop to a first value for the first Nxe2x88x921 iterations of the outer loop. When the final iteration of the outer loop is detected, the epilog counter for the inner loop is set to a second value. The first value indicates that the pipeline should not be drained. The second value indicates the actual number of stages in the software pipelined inner loop to be drained. For other embodiments, the values written to the epilog counter for the drain and no-drain modes may be adjusted to account for instruction dependencies between the inner and outer loops.