1) Field of the Invention
The present invention relates to a translator program, a program converting method, and a translator device for analyzing a parallel language program and converting the parallel language program to a sequential language program that is subjected to distributed processing by a plurality of processors. More particularly, the present invention relates to a translator program, a program converting method, and a translator device capable of preventing reduction in execution efficiency of a loop due to calculation of a loop index.
2) Description of the Related Art
Parallelism of scientific and engineering calculations is carried out based on distributed processing of data and distributed processing Of a loop. In the former, the data is allocated to a plurality of processors, each of which performs processing on each data allocated. In the latter, the loop is allocated to a plurality of processors, each of which performs processing on each loop allocated. FIG. 21A is a diagram for explaining distribution of data, and FIG. 21B is a diagram for explaining distribution of a loop.
As shown in FIG. 21A, a one-dimensional array A consisting of 30 elements of data is divided into four blocks that consist of eight elements of data, eight elements of data, eight elements of data, and six elements of data, respectively. Four processors perform distributed processing on the four blocks. As shown in FIG. 21B, a process of 30 repetitions for a loop variable l is divided into four blocks that consist of eight repetitions, eight repetitions, eight repetitions, and six repetitions, respectively. Four processors perform distributed processing on the four blocks.
As explained above, in the scientific and engineering calculation, the distributed processing performed on the data and the loop allows high speed of processing. It is noted that division of the data and division of the loop are integrally handled as division of an index.
In a parallel programming language such as High Performance Fortran (HPF), OpenMP, and VPP Fortran, division of an index can be specified. More specifically, VPP Fortran has a format for expressing data division and loop division. HPF has a format for expressing the data division and can indirectly express the loop division. OpenMP has a format for expressing the loop division but does not have a format for expressing the data division because this language is targeted for a shared memory computer.
These parallel programming languages provide a plurality of types of index divisions. FIG. 22 is a schematic diagram for explaining types of distribution based on the index divisions, in which 30 elements are divided into four blocks. As shown in this figure, the distribution based on the index divisions includes five types as follows: block distribution, cyclic distribution, block-cyclic distribution, uneven block distribution, and irregular distribution.
Details of the respective distributions are as follows. The block distribution is suitable for an application in which there is a close correspondence between adjacent pieces of data (array elements) because data continuity remains in each dimension distributed. If calculation load is uneven on calculation ranges, or if the capabilities of parallel computers are nonuniform (hetero environment), the uneven block distribution may be used instead of the block distribution, which allows load balance to be adjusted.
If the load balance or the calculation range is different in each execution loop or is undefined until execution, then the cyclic distribution may be used, which allows the load distribution to be made almost even. In the cyclic distribution, however, if there is a close correspondence between adjacent pieces of data, communications frequently occur. In this case, the block-cyclic distribution is used as an intermediate method between the block distribution and the cyclic distribution.
If a correspondence between data and processors is irregular like in a case where particles floating in space are calculated and the correspondence needs to be controlled by a table, then the irregular distribution can be used.
FIG. 23A to FIG. 23C are diagrams for explaining parallelism based on a parallel programming language. FIG. 23A is an example of a sequential program, FIG. 23B is an example of parallelism based on OpenMP, and FIG. 23C is an example of parallelism (input example 1) based on HPF.
As shown in FIG. 23B, a “parallel do” directive on line 4 directs so as to parallelize a loop “do l”. Since OpenMP is based on sharing of data, there is no directive for division and arrangement of data.
As shown in FIG. 23C, line 2 and line 3 direct that a variable A in the first-dimensional direction is evenly divided into four blocks and the four blocks are arranged in four processors. An “independent” directive on line 6 directs that the loop do l can be parallelized.
To parallelize the loop, it is necessary to decide a range of a loop index to be executed by each processor. In OpenMP, how to divide the loop is decided in language specification, and the range of the index is mechanically divided. In HPF, a compiler automatically decides how to divide the loop so as to match, as much as possible, with division and arrangement of a variable (A in this example) that is accessed in the loop.
As explained above, the compiler generates code with which a plurality of processors can actually parallelize a program parallelized. At this time, the compiler divides the index ranges of the data and the loop, and allocates the index ranges divided to the processors. By converting indexing of the data, a declaration type is made identical among the processors, which allows static data allocation using a Single Program/Multiple Data (SPMD) system.
FIG. 24A is a diagram for explaining a correspondence between divided indexes of the data or the loop and each processor. In an array a (i, j) of 1000×1000, data from line 1 to line 250 is allocated to a zero-th processor P(0), data from line 251 to line 500 is allocated to a first processor P(1), data from line 501 to line 750 is allocated to a second processor P(2), and data from line 751 to line 1000 is allocated to a third processor P(3).
FIG. 24B is a diagram of code output by a conventional compiler for the parallel language program shown in FIG. 23C. As shown in FIG. 24B, myID is a processor number, and by using the myID, a lower dimension bound (hereinafter, “lower bound”) myLB and an upper dimension bound (hereinafter, “upper bound”) myUB of an index range which each processor takes charge of are calculated. The lower bound and the upper bound are used to specify a repetition range and make an access to data.
If an access is made to data a (i, j), i=gtol(I)=I−myLB+1=I−250*myID is required when the access is executed. Specifically, this program indicates conversion from a global index I that is an index of an input program before conversion to a local index i that is an index of a program after the conversion.
The conversion “gtol” becomes a different function depending on a distribution type or a parameter (the length of an index or the number of divisions) and becomes a complex computational expression depending on the distribution type. Therefore, the “gtol” is often realized by calling an execution-time library. In the case of irregular distribution, the “gtol” cannot be expressed only by one expression, and therefore, a table is used for reference.
Such a technology is disclosed in Non-patent literatures as follows:
Non patent literature 1: High Performance Fortran Forum: “High Performance Fortran Language Specification Version 2.0.”, 1997;
Non-patent literature 2: “High Performance Fortran 2.0 Official Manual”, Springer-Verlag Tokyo, 1999 (lSBN4-431-70822-7) (translation by Fujitsu Ltd., Hitachi Ltd., and NEC Corp.);
Non-patent literature 3: Japan Association for High Performance Fortran (JAHPF): “HPF/JA Language Specification Version 1.0”, November 1999, searched on Jun. 18, 2004, Internet<URL: HYPERLINK http://www.hpfpc.org/jahpf/spec/hpfja-v10-eng.pdf http://www.hpfpc.org/jahpf/spec/hpfja-v10-eng.pdf>;
Non-patent literature 4: OpenMP Architecture Review Board, “OpenMP Fortran Application Program Interface Version 2.0”, November, 2000, searched on Jun. 18, 2004, Internet<URL: HYPERLINK http://www.openmp.org/specs/mp-documents/fspec20.pdf http://www.openmp.org/specs/mp-documents/fspec20.pdf>; and
Non-patent literature 5: OpenMP Architecture Review Board, “OpenMP C/C++Application Program Interface Version 1.0”, October, 1998, searched on Jun. 18, 2004, lnternet<URL:HYPERLINK http://www.openmp.org/specs/mp-documents/cspec10.pdfhttp://www.openmp.org/specs/mp-documents/cspec10.pdf>.
In the code as shown in FIG. 24B, the conversion gtol from the global index I to the local index i is comparatively simple. However, even in this case, the cost of index conversion (execution of numeral operations) is quite high as compared with the cost (for about one time of “load” command) of variable access when the index conversion is not performed. Therefore, the execution time required for this part may be increased to several times the execution time.
A more generic case includes a case of distribution other than block distribution, a case of a mismatch between an initial value or a final value of the loop and a declaration type of an array, a case of an increment other than 1 in the loop, and a case where the size of an index cannot be divided by the number of processors. In the more generic case, the parallelized loop results in more complex one, which causes execution efficiency to be reduced.
For example, FIG. 25A is a diagram for explaining a parallel language program in which block-cyclic distribution is used. In this example, the upper and lower bounds of a loop are different from those of an array. In this case, a correspondence between global indexes and local indexes are shown in FIG. 25B.
As shown in FIG. 25B, in the processor P(0), global indexes “1-5” correspond to local indexes “1-5”, global indexes “21-25” correspond to local indexes “6-10”, and global indexes “981-985” correspond to local indexes “246-250”.
FIG. 25C is a table of conventional code generated for the parallel language program shown in FIG. 25A. As shown in FIG. 25C, an index conversion expression is (I−1)/20*5+MOD(I−1,5)+1 (the division is integer division in which the remainder is rounded down). As for the parallel loop, the range to be executed by each processor turns into a discontinuous range in which five elements are made one block. Therefore, the range cannot be expressed using a single DO loop, and a complex double loop is used. The complexity of output code is likely to cause reduction in execution efficiency.
FIG. 26A is a diagram for explaining a parallel language program in which irregular distribution is used as another example of distribution. In this case, a correspondence between global indexes and local indexes is as shown in FIG. 26B.
FIG. 26C is a diagram of a conventional code generated for the parallel language program shown in FIG. 26A. In this code, index conversion cannot be expressed by an expression, and therefore, a table GTOL generated by the compiler is used for reference.
Since a value of a loop index executed in each processor cannot be expressed by a DO statement of Fortran, an IF statement is used to select repetition to be executed. In other words, there occurs such waste that all the loop ranges are executed by all the processors. The amount of waste becomes relatively larger when the number of processors is increased.