1. Field of the Invention
The present invention relates to an optimization apparatus for use in a compiler that compiles a source program including loops with arithmetic expressions into an object program including instruction sequences, and a computer-readable storage medium that stores an optimization program. The present invention particularly relates to improvements in optimization techniques to generate codes for a processor that executes instructions in parallel.
2. Related Art
How to improve the execution efficiency of loop structures that use “for statements”, “while statements”, or the like has long been one of the subjects in the field of language processing.
Generally, a loop structure (hereafter referred to as a loop) is composed of (a) a control statement, such as a “for statement” or a “while statement”, and (b) a body made up of at least one arithmetic expression. In loop processing, the body is repeatedly executed until the repeat condition prescribed by the control statement is satisfied. The run unit in the loop processing is called an iteration, and as many iterations as the number of repetition prescribed by the control statement are executed. For example, when the control statement indicates that the body of the loop is to be repeated 100 times, 100 iterations of the body of the loop are executed. Needless to say, executing some or all of the iterations in parallel can improve the execution efficiency of the loop. Optimization techniques, such as loop unrolling and software pipelining, are conventionally known to be effective in realizing parallelism within a loop.
Loop unrolling is an optimization technique that improves the execution efficiency of a loop, by converting an arithmetic expression included in the body of the loop into a plurality of arithmetic expressions. FIG. 1A shows a loop in which the arithmetic expression “a[i]=b[i]*(x+10)” that defines the element of the array “a” using the element of the array “b” is repeated until a variable “i” reaches 100. When loop unrolling is applied to the above loop, the arithmetic expression “a[i]=b[i]*(x+10)” in the loop body is transformed into two arithmetic expressions “a[i]=b[i]*(x+10)” and “a[i+1]=b[i+1]*(x+10)”, as shown in FIG. 1B. The expression “i++” in FIG. 1A indicates that the induction variable “i” is incremented by 1 each time the loop is repeated, whereas the expression “i+=2” in FIG. 1B indicates that the induction variable “i” is incremented by 2 each time the loop is repeated.
In FIG. 1A, one array element of the array “a” is determined each time the loop is repeated. In FIG. 1B, however, each time the loop is repeated, the two arithmetic expressions “a[i]=b[i]*(x+10)” and “a[i+1]=b[i+1]*(x+10)” are executed in parallel, defining two array elements of the array “a”.
Software pipelining is another optimization technique to improve the execution efficiency of a loop, by a compiler compiling the body of the loop into a machine instruction suitable for pipeline processing.
The following is an explanation of how software pipelining is applied, with reference to FIG. 2. FIG. 2A shows an example of a loop body which is composed of an instruction A, an instruction B, and an instruction C. It is assumed that these instructions cannot be executed in parallel as they have data dependency in the body of the loop. FIG. 2B shows an example where the instruction sequence shown in FIG. 2A is repeated five times in pipeline processing. In the figure, the vertical axis shows cycle, and the horizontal axis shows the number of iterations. In the figure, the horizontal axis shows the numbers 1 to 5, which means that five iterations are generated (hereafter referred to as a first iteration, a second iteration, a third iteration, a fourth iteration, and a fifth iteration).
During cycles 1 and 2, the instructions A and B in the first iteration and the instruction A in the second iteration are put in the pipeline. At this stage, no instruction in the third to fifth iterations is yet put in the pipeline. This stage is referred to as “prolog”, where there is at least one iteration whose instruction is not put in the pipeline. During cycles 3 to 5, the instruction C in the first iteration, the instructions B and C in the second iteration, the instructions A to C in the third iteration, the instructions A and B in the fourth iteration, and the instruction A in the fifth iteration are put in the pipeline. This stage is referred to as “steady state”. During cycles 6 and 7, the instruction C in the fourth iteration, the instructions B and C in the fifth iteration are put in the pipeline. This stage is referred to as “epilog”, where the iterations of the loop are completed.
To execute the instructions in parallel as shown in FIG. 2B, the compiler outputs the sequence of machine instructions shown in FIG. 2C (machine instruction sequences are expressed by assembler codes in this specification). In the figure, the code “E” denotes an end bit, which indicates that instructions preceding the code “E” are executed in parallel. In the prolog, the first iteration is compiled into the instruction A and the end bit, and the second iteration is compiled into the instructions A and B and the end bit. In the epilog, the fifth iteration is compiled into the instruction C and the end bit, and the fourth iteration is compiled into the instructions B and C and the end bit.
In the steady state, the first through fifth iterations are compiled into the instructions A, B, C, the branch instruction “bt L1”, and the end bit, describing that the instructions A, B, and C are repeated predetermined number of times.
By generating such instructions through software pipelining, the performance of the loop processing can be enhanced.
However, the above explained conventional techniques suffer from the following problems. Loop unrolling cannot be applied to a loop when iterations of the loop cannot be executed in parallel because of a carry dependency present between the iterations, making it impossible to accomplish the speed-up in the loop processing. In software pipelining, the execution efficiency of loop processing cannot be improved when a carry dependency exists between close iterations, like when a value resulting from one iteration is used in the following iteration.
Suppose software pipelining is applied to a source program shown in FIG. 3A. When the loop body in the source program in FIG. 3A is compiled into a sequence of instructions, the assembler codes shown in FIG. 3B are obtained. In the figure, the assembler code “load a[i], r0” is an instruction A to load an array element “a[i]” is loaded into “r0”, the assembler code “mul 3, r0” is an instruction B to multiply a value of “r0” by 3, the assembler code “add 2, r0” is an instruction C to add 2 to the value of “r0”, and the assembler code “store r0, a[i+1]” is an instruction D to store the value of “r0” into the array element “a[i+1]”. Following this, the variable “i” is updated, and, a conditional branch to “L1” is performed using the value of “i” as a repeat condition (it should be noted here that the array element “a[i]” in the load instruction and the array element “a[i+1]” in the store instruction are expressed using variables in the source program for ease of explanation).
In this case, the load instruction of the instruction A loads a value from an address stored in the instruction D in the immediately preceding iteration. Therefore, the instruction A cannot be executed until the store instruction in the instruction D in the immediately preceding iteration is completed. Thus, even if software pipelining is applied to the source program in FIG. 3A, there is an execution delay of 4 cycles between the start of one iteration and the start of the following iteration, as shown in FIG. 3C.