1. Field of the Invention
The present invention relates to a method and system for loop conversion. More particularly, it concerns a method and system for loop conversion for generating object code to execute a loop for accessing a multi-dimensional array at high speed for use in a processor having an instruction for loading or storing two data at a time.
1. Description of the Prior Art
Many optimization techniques for increasing the efficiency of executing object code generated by a compiler have been proposed. Of these techniques, optimization of loops is particularly important because a loop is executed at a high frequency in a program and occupies a high percentage of the execution time. The prior loop optimization techniques are described, for example by Masataka Sasaki in the "Programming Language Processing System", Iwanami Lecture Software Science, Vol. 5, pp. 459-493, 1989, and by Hirotoku Kasahara in the "Parallel Processing Techniques", Corona Inc., pp. 113-128, 1991.
On the other hand, with respect to hardware, there have been many methods for making the speed of the object code faster. One of the methods uses a super scalar processor. The super scalar processor fetches a plurality of, for example two or four, consecutive instructions at a time and if these instructions are executable in parallel, executes them in parallel.
For the super scalar method, it is necessary that the object program to be executed has enough parallelism to make the processor execute at high speed. The reason for this is that the super scalar processor executes instructions in parallel only when possible. For example, if a value calculated with an instruction is used with the next instruction, the two instructions cannot be executed in parallel. There is a case that a program written in a high-level language, such as FORTRAN or C, is compiled to execute with the super scalar processor. In that case, if the source program does not have parallelism explicitly, it may be optimized so parallelism is achieved when the compiler translates the source program to machine language. Parallelism is particularly important in loops contained in the program.
As an example, let us discuss the following triple nested loop in FORTRAN. ##EQU1##
In the execution of this program statement (6), which is the innermost statement nested in the loop, is executed at the highest frequency. It is most effective to optimize, or parallelize, statement (6) of the program. Statement (6) is executed by the following five instructions, with array address calculation omitted. EQU Load R(J, K) (8) EQU Load R(I, K) (9) EQU Calculate R(J, I)*R(I, K)(10) EQU Calculate R(J, K)-(Result of (10)) (11) EQU Store (Result of (11)) into R(J, K) (12)
R(J, I) is not loaded because the value of the expression is first loaded outside the loop once. Therefore, it not necessary to load it within the loop since the value will not change.
Of the above five instructions, only instructions (8) and (9) can be executed in parallel. The others cannot be executed in parallel since these all use the results of the preceding instructions. This means that the above program execution does not make the most of the characteristics of the super scalar processor which can issue a plurality of instructions at one time.
Loop unrolling is one of the techniques for drawing parallelism from such a program. It is an optimizing technique in that the contents of the loop are plurally copied to unroll or reduce the number of loop repetitions and at the same time, increase the number of independent instructions. As an example, the innermost loop in the program cited above can be unrolled two times as follows. (Only the innermost loop is shown. Note that the value of "K" is increased by two.) EQU D0 8 K=I+1, 256, 2 (13) EQU R(J, K)=R(J, K)-R(J, I)*R(I, K) (14) EQU R(J, K+1)=R(J, K+1)-R(J, I)*R(I, K+1) (15) EQU 8 CONTINUE (16)
In the unrolling, the instruction strings executed by statements (14) and (15) are as follows. EQU Load R(J, K) (17) EQU Load R(I, K) (18) EQU Calculate R(J, I)*R(I, K) (19) EQU Calculate R(J, J)-(Result of (19)) (20) EQU Store (Result of (20)) into R(J, K) (21) EQU Load R(J, K+1) (22) EQU Load R(I, K+1) (23) EQU Calculate R(J, I)*R(I, K+1) (24) EQU Calculate R(J, K+1)-(Result of (24)) (25) EQU Store (Result of (25)) into R(J, K+1) (26)
Of these instructions, (17) and (22), (18) and (23), (19) and (24), and so on are independent from each other. They therefore can be (logically) executed in parallel. That is, loop unrolling can increase the parallelism of the program.
However, even if the instructions are logically executable in parallel, hardware restrictions may limit such execution. In other words, if two instructions compete for use of the same execution units, such as an adder, multiplier, and memory port, the two instructions cannot be executed in parallel. Such instructions, for example, include an add instruction and another add instruction, a load instruction and another load instruction, or a load instruction and a store instruction. Of the above-mentioned instruction strings, (17) and (22), (18) and (23), and so on cannot be physically executed in parallel as they are both load instructions. As six of the ten instruction strings from (17) to (26) are load or store instructions, the program cannot obtain sufficient parallelism. In general, it frequently occurs that programs have a bottleneck in such load and store instructions.
In order to solve such a problem, we may use a processor having an instruction for loading or storing two data at a time. (Note that the data to be loaded or stored must be consecutively arranged in the memory. Also, note that both loading and storing cannot be executed at the same time.) With use of the instruction, the number of the load and store instructions can be reduced. This, therefore, can reduce the possibility of competition for the execution units (memory port).
For the instruction mentioned above which loads or store two data at a time, the data to be loaded or stored must be arranged in a consecutive area in the memory. Even if the loop unrolling is made, however, the data to be loaded or stored are not always consecutively arranged.
For example, of the instruction strings (17) to (26) after the loop unrolling, the four data to be loaded are R(J, K), R(J, K+1), R(I, K), and R(I, K+1). These data are not consecutive in the memory. This is because references to the former two data and the latter two data of the four data have right-hand subscripts that change consecutively. In multi-dimensional arrays of FORTRAN, the leftmost subscripts are arrayed to change in the memory as R(1, 1), R(2, 1), R(3, 1), and so on. (On the contrary, in the C language, the rightmost subscripts are arrayed to change.) In addition, there are two data to be stored, including R(J, K) and R(J, K+1). These are also are not located consecutively in memory. The example program cannot effectively use the instruction for loading or storing two data at a time under these circumstances. Therefore, the bottleneck in the load and store instruction is still not solved.
Another method for solving the bottleneck problem is an optimization technique called loop interchange, which is described in the document mentioned above. The loop interchange is a technique in which an inner and outer loop of a multi-loop are transposed. However, this technique cannot be applied to all loops. It cannot be applied if the loop interchange changes meaning (execution results) of the program. For the loop in the example, the meaning of the program is changed when the inner and outer loops are interposed.