The present invention relates to a method and apparatus for executing instructions in a computer. More specifically, the present invention relates to a method and apparatus for improving the performance of assembly code generated by an optimizing compiler. This is done by integrating data prefetching and modulo scheduling, and by inserting prefetch instructions generated after module scheduling has been performed.
Modern processors employ numerous techniques to provide the performance that today""s applications require. One such technique is instruction pipelining. Instruction pipelining is a processing technique whereby multiple instructions are overlapped in execution. This simultaneous execution of instruction is sometimes referred to as instruction-level parallelism. It is a goal of pipelined architectures to maximize instruction-level parallelism.
In a processor having a pipelined architecture, one means of increasing instruction-level parallelism is the use of multiple pipelines, in which instructions are issued using a scheduler or similar hardware construct. Processors employing such constructs are commonly known as superscalar processors. Instructions may be scheduled for issue to the pipelines based on numerous factors, such as pipeline availability, op-code type, operand availability, data dependencies, and other factors. In such architectures, the processor""s pipelines must be used efficiently to maximize instruction-level parallelism. This means that the pipelines are fed instructions as quickly as possible, while pipeline stalls are minimized.
To ensure that the various pipelines in superscalar processors are efficiently used, complex compilers have been created to generate code that takes maximum advantage of the of superscalar processors"" capabilities. Such compilers orchestrate the issue and execution of instructions to maximize instruction level parallelism and so, throughput. One method employed in such compilers is modulo scheduling.
FIG. 1 illustrates the execution of seven iterations of a pipelined loop as scheduled for execution by a compiler having modulo scheduling capabilities. Modulo scheduling achieves instruction level parallelism by beginning the execution of one or more iterations of a loop before the previous iteration has completed. This is done by issuing instructions in the subsequent iteration(s) to available pipelines within the superscalar processor. A concept fundamental to modulo scheduling is initiating new iterations at fixed intervals. The interval employed is referred to as the initiation interval or iteration interval (II), and is exemplified in FIG. 1 by an iteration interval 100.
The scheduled length of a single iteration (trip length, or TL) is divided into stages, each one of a length equal to iteration interval 100. The number of stages that each iteration requires may be defined as:                     SC        =                  TL          II                                    (        1        )            
where SC is stage count. The three phases of loop execution (after modulo-scheduling) are shown in FIG. 1 as a prologue 102, a kernel 104, and an epilogue 106. During prologue 102 and epilogue 106, not all stages of successive iterations execute. Only during kernel 104 are all stages of the loop being executed. Prologue 102 and epilogue 106 last for (SC-1)*II machine cycles. The number of times the loop is to be iterated is known as the trip count. If the trip count is relatively large, kernel 104 will last much longer than prologue 102 or epilogue 106. The primary performance metric for a modulo-scheduled loop is the iteration interval. It is a measure of the steady state throughput for loop iterations. Smaller iteration interval values imply higher throughput. Therefore, a modulo scheduler preferably attempts to derive a schedule that minimizes the iteration interval. The time required to execute n iterations is:
T(n)=(n+SCxe2x88x921)xc2x7IIxe2x80x83xe2x80x83(2)
The average time to execute one of these iterations is effectively:                                           T            ⁡                          (              n              )                                iteration                =                                            (                              n                +                SC                -                1                            )                        ·            II                    n                                    (        3        )            
As can be seen from Equation 3, T(n)iteration approaches II as n approaches infinity.
The execution of the loop scheduled in FIG. 1 begins with stage 0 of a first iteration 110. During the first II machine cycles, no other iteration executes concurrently. This exemplifies iteration interval 100. This also marks the beginning of prologue 102. After the first II machine cycles, first iteration 110 enters stage 1 and a second iteration 112 enters stage 0. New iterations (e.g., a third iteration 114) join every II machine cycles until a state is reached when all stages of different iterations are executing. This marks the beginning of kernel 104, and is exemplified by the execution of a fourth iteration 116, a fifth iteration 118, a sixth iteration 120, and a seventh iteration 122. Toward the end of loop execution, during epilogue 106, no new iterations are initiated and those that are in various stages of progress gradually complete.
Scheduling in a compiler employing modulo scheduling may proceed in a manner similar to the following. The data dependence graph (DDG), a directed graph, is constructed for the loop being scheduled. The nodes in this graph correspond to instructions, with the arcs corresponding to dependencies between them. Two attributes the arcs possess are latency and the dependence distance (also referred to as omega or xe2x80x9cxcexa9xe2x80x9d). Latency is the number of machine cycles of separation required between a source instruction and a sink (or destination) instruction. A source instruction usually provides some or all of the data used by a destination instruction. Omega represents the iteration distance between the two nodes (instructions). In other words, omega represents the number of loop iterations from the source instruction to the destination instruction. For example, data generated by a source instruction in the first iteration may not be needed until the third iteration, equating to an omega of two.
Prior to scheduling, two bounds on the maximum throughput are derived: the minimum iteration interval (MII) and the recurrence minimum iteration interval (RMII). The MII is a bound on the minimum number of machine cycles needed to complete one iteration and is based only on processor resources. For example, if a loop has ten add operations and the processor is capable of executing at most two add operations per machine cycle, then the add unit resource would limit throughput to at most one iteration every five machine cycles. The MII is computed by determining the maximum throughput for each resource in terms of iterations per machine cycle, in turn, and taking the minimum of those maximum throughput values as the processor""s maximum guaranteed throughput.
The RMII is a bound on the minimum number of clocks needed to complete one iteration and is based only on dependencies between nodes. DDG cycles imply that a value xi computed in some iteration i is used in a future iteration j and is needed to compute a similarly propagated value in iteration j. These circular dependencies place a limit on how rapidly iterations can execute because computing the values needed in the DDG cycle takes time. For each elementary DDG cycle, the ratio of the sum of the latencies (l) to the sum of the omegas (d) is computed. This value limits the iteration throughput because it takes l machine cycles to compute values in a DDG cycle that spans d iterations.
The fixed spacing between overlapped iterations forces a constraint on the scheduler other than the normal constraints imposed by the arcs in the DDG. Note that placing an operation at a time t implies that there exists a corresponding operation in the kth future iteration at (t+k*II). Operations that use the same resource must be scheduled such that they are executed at different times, modulo the II, to avoid stalls caused by a resource being in use. This is referred to as the modulo constraint. The modulo constraint implies that if an operation uses a resource at time t1 and another operation uses that same resource at time t2, then t1 and t2 must satisfy:
t1modIIxe2x89xa0t2modIIxe2x80x83xe2x80x83(4)
A modulo scheduler begins by attempting to derive a schedule using II=max (MII, RMII). If a schedule is not found, the II is incremented. The process repeats until a schedule is found or an upper limit is reached. After scheduling, the kernel has to be unrolled and definitions renamed to prevent values from successive iterations from overwriting each other, for example, by writing to the same registers. The minimum kernel unroll factor (KUF) necessary is determined by the longest lifetime divided by the II because corresponding new lifetimes begin every II machine cycles. Remaining iterations (up to KUFxe2x88x921) are executed in a cleanup loop.
Compilers employing modulo-scheduling thus provide efficient utilization of superscalar architectures. However, in order to make efficient use of such architectures, instructions and data must be made available in a timely manner. To maintain the high data rates required, designers commonly employ multi-level memory architectures including, for example, main memory units and cache memory units. Additional memory levels are normally added in the form of multiple cache levels (e.g., on-chip and off-chip cache memory).
Some of today""s microprocessor architectures extend this construct by delineating between the caching of instructions and data. Recently, specialized data caches have been included in certain microprocessor architectures to allow for the storage of certain information related on the basis of various characteristics, such as repetitive use in floating point or graphics calculations. To make the best use of such caches, it is often desirable to load (or fetch) the requisite data prior to its being needed. In this manner, data that is likely to be needed in the future can be loaded while other operations are performed. This technique is known as data prefetching.
As can be seen, modulo scheduling and data prefetching are complimentary techniques. Modulo scheduling maximizes instruction-level parallelism by making efficient use of multiple pipelines, and data prefetching is an efficient way to make data available at the high data rates mandated by superscalar architectures. Thus, it is desirable to employ both techniques to maximize performance.
One approach to combining data prefetches and modulo-scheduling would be to first unroll the loop body and then simply precede the unrolled body in toto with the requisite data prefetch instructions. In such a case, the entire loop body is unrolled a certain number of times, and the requisite number of prefetches inserted prior to scheduling the loop.
Unfortunately, this is a simplistic solution to a complex problem. In effect, this approach yields a very large loop, resulting in the generation of cumbersome assembly code that would fail to make efficient use of a superscalar processor""s resources (e.g., the schedule obtained for the larger loop body can be poor, causing stalls in one or more of the pipelines). Moreover, the resulting assembly code would be inefficient with respect to simply unrolling the loop in question, without the data prefetches, because data fetches in the loop would at least load the requisite data at the proper time (i.e., the problem of data wait stalls would not be encountered).
Thus, it is desirable to effectively and efficiently marry the technique of modulo scheduling with data prefetching in order to increase instruction level parallelism while effectively using data prefetching to maintain efficient use of cache memory and data throughput. Preferably, such a technique should not require major modifications to present compiler technology to effect support of such capabilities. Further, the number of data prefetches should be minimized to maximize the efficiency with which the architecture""s memory hierarchy is utilized.
The present invention solves the problems associated with the prior art by efficiently and effectively marrying modulo scheduling techniques with data prefetching techniques. The present invention does so by taking into consideration parameters related to the pending insertion of prefetch instructions in the code generated by modulo scheduling. Subsequently, prefetch instructions are inserted at even intervals in the assembly code generated. This allows the assembly code generated to maximize the efficiency with which a loop is executed. The insertion of prefetch instructions by the present invention is performed in a manner which minimizes spills and reloads, thus maximizing the efficiency with which the architecture""s resources are utilized.
According to one aspect of the present invention, a computer-implemented method for compiling source code, having a loop therein, into output code is disclosed. The method begins by calculating a prefetch-based kernel unroll factor for the loop. This entails calculating a prefetch-based unroll factor for the loop, calculating a kernel unroll factor for the loop, and calculating the prefetch-based kernel unroll factor by adjusting the kernel unroll factor using the prefetch-based unroll factor. Next, the output code is generated. Finally, one or more prefetch instructions are inserted into the output code. This entails determining a prefetch ordering, inserting a group of prefetch instructions into the output code, and determining an address displacement for at least one prefetch instruction in the group of prefetch instructions. It is expected that the group of instructions includes at least one prefetch instruction.
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.