A compiler is a computer program that transforms a source computer program written in one language, such as Java, C or C++, into a target computer program that has the same meaning but is written in another language, such as an assembler or machine language. Compiler tasks are described in further detail in, for example, Compilers: Principles, Techniques, and Tools by A. Aho et al. (Addison Wesley, 1998), which is hereby fully incorporated herein by reference.
A compiler that is particularly well suited to one or more aspects of the code optimization task may be referred to as an optimizing compiler. One strategy that an optimizing compiler may pursue focuses on optimizing transformations, which are described in D. Bacon, et al., “Compiler Transformation for High-Performance Computing,” in ACM Computing Surveys, Vol. 26, No. 4 (Dec. 1994), which is hereby fully incorporated herein by reference. Such transformations typically involve high-level, machine-independent, programming operations (i.e., “high level optimizations”) including, for example, removing redundant operations, simplifying arithmetic expressions, moving code that will never be executed, removing invariant computations out of loops, and storing values of common sub-expressions rather than repeatedly computing them.
Other strategies that an optimizing compiler may pursue focus on machine-dependent transformations (i.e., “low level optimizations”), and include instruction scheduling and register allocation.
A principal goal of some instruction scheduling strategies is to permit two or more operations to be executed in parallel, a process referred to as instruction level parallel (ILP) processing, which is typically implemented in processors with multiple execution units. One way of communicating with the central processing unit (CPU) of the computer system is to create very long instruction words (VLIWs), which specify the multiple operations that are to be executed in a single machine cycle. For example, a VLIW may instruct one execution unit to begin a memory load and a second execution unit to begin a memory store, while a third execution unit is processing a floating point multiplication. Each execution task has a latency period (i.e., the task may take one, two, or more cycles to complete). The objective of ILP processing is to optimize the use of the execution units by minimizing the instances in which an execution unit is idle during an execution cycle. ILP processing may be implemented by the CPU and/or by an optimizing compiler.
In many applications, the majority of execution time is spent in loops. One known technique for improving the instruction level parallelism (ILP) in loops is referred to as “software pipelining”. The operations of a single loop iteration are separated into s stages. After transformation, which may require the insertion of startup code to fill the pipeline for the first s−1 iterations and cleanup code to drain the pipeline for the last s−1 iterations, a single iteration of the transformed code will perform stage 1 from pre-transformation iteration i, stage 2 from pre-transformation iteration i−1, and so on. Such single iteration is known as the kernel of the transformed code.
A particular known class of algorithms for achieving software pipelining is commonly referred to as “modulo scheduling”, as described in James C. Dehnert and Ross A. Towle, “Compiling for the Cydra 5,” in The Journal of Supercomputing, volume 7, (Kluwer Academic Publishers, Boston 1993), which is hereby fully incorporated herein by reference. Modulo scheduling is also described in the following reference, which is hereby fully incorporated herein by reference: B. R. Rau, “Iterative Modulo Scheduling,” in The International Journal of Parallel Processing, volume 24, no. 1 (February 1996). Modulo scheduling initiates loop iterations at a constant rate called the initiation interval (II).
However, it would be desirable to further optimize the machine code that is generated by use of modulo scheduling techniques.