1. Field of the Invention
The present invention relates to a compiler which converts a source program into a program for parallel computer use, and in particular to a parallel compiler device and compilation method for optimizing data transmission between the processing elements of a parallel computer.
2. Description of the Related Art
In recent years, the development of parallel computer systems which execute programs in parallel has been advancing. Parallel computers have a plurality of processing elements (PE) which execute a program, and achieve a parallel process by means of the various PE executing the program they have been assigned. FIG. 1 is a block diagram which shows an example construction of a parallel computer. The parallel computer in this diagram is comprised of processing elements PE1 through PE8, with each of these PE having a processor and a memory. Each memory stores the program to be executed by each processor. This program must be compiled so that each it can be processed in parallel by every PE.
Normally, source programs written in high-level languages such as FORTRAN are written on the premise of a serial process. In order to have these source programs executed by a parallel computer, it is necessary to equip the compiler which generates the object program from the source program with a function for generating an object program for parallel execution use.
The following is a description of a compiler which generates an object program for parallel execution use from a source program constructed according to the prior art. Such a compiler uses a method for having parallel execution, by extracting the parallelism from the repetitive processes contained in the source program, such as do loops, and allocating the iterations of the loops to each PE. Here, the extraction of the parallelism is executed for every multiple loop separately. The above technique is described in [David A. Pauda et al.:Advanced compiler optimizations for supercomputers, Communications of the ACM, pp1184-1201 (1986)]. In the following explanation, groups of instructions in the program which are usually executed repetitively will be called loops (or loop processes), while the process executed by one cycle of one of these loops will be referred to as an iteration.
In FIG. 2 an example of a source program written in FORTRAN is given. In this example program, there is a first multiple loop consisting of the loops 301, 302 and a second multiple loop consisting of the loops 303, 304. Using a compiler constructed according to the prior art, it is first determined that parallelization is possible for the first multiple loop and the second multiple loop, and then the respective multiple loops are parallelized. FIG. 3A shows part of the program once the first multiple loop has been parallelized, while FIG. 3B shows part of the program once the second multiple loop has been parallelized. In FIG. 3A, the programs are shown as being executed by PE1-PE8 with regard to i in loop 301. In FIG. 3B, the programs are shown as being executed by PE1-PE8 with regard to j in loop 303.
In general, for the parallel computer shown in FIG. 1, the transmission between the processor elements of data which is necessary for the calculations is frequently executed. After the program in FIG. 3A has been executed, then the array a(i,j) is stored having been distributed as shown in FIG. 4 in the memory in every PE. After that, in order to execute the program in FIG. 3B, then it is necessary to store the array a(i,j) in the memory of every PE in the way shown in FIG. 5. Consequently, after the program shown in FIG. 3A has been executed, data transmission is executed between all of the PE, and the program shown in FIG. 3B is executed.
However, according to the above prior art, since the parallelization is executed for every multiple loop separately, there is the problem that during parallelization, the number of data transmissions is not necessarily the lowest possible number. For the above example, only the data transmitted to PE1 from every PE is shown in FIG. 6. Since there are 8 array elements to be transmitted in the same way as with PE1, then from PE1-PE8 there are 64 sets of array elements to be transmitted.