The present invention relates to a method of generating by a paralleling computer a parallel program from a source program, and in particular, to a parallel program generating method capable of optimizing data locality using data distribution and a recording media on which a program of the method is stored.
As a method of a logically shared, physically distributed memory for a distributed shared memory parallel computer, there has been a method in which a virtual memory space to be logically shared among a plurality of processors (nodes) is subdivided into units called pages such that the pages are allocated to physically distributed memories of respective processors. To determine allocation of pages to the processors, there have been known two methods as follows.
A first data distribution method is called first touch method in which when data is first referred to, a page including the data is distributed to a memory of a processor which refers to the data.
In a second data distribution method, a data distribution indicating statement or sentence is explicitly used to specify a data distribution format.
Assume, for example, a sequential execution source program 11 shown in FIG. 9 is inputted. Assume that the system includes a distributed shared memory parallel computer including four processors and a page size is five array elements. Array elements are allocated to processors Pe0 to Pe3 according to first touch data distribution. Elements of array A are first referred to by a processor in an initialization loop (lines 23 to 25 of FIG. 9) of procedure init. Therefore, the elements of array A, i.e., A(1:25), A(26:50), A(51:75), and A(76:100) are allocated to pe0 to pe3, respectively. In this connection, pe0 to pe3 represent processors 0 to 3, respectively.
When the array elements are simply allocated according to an initialization loop first referred to by a processor as above, the data is distributed such that the elements 1:100 are equally distributed, i.e., 25 elements are distributed to each of processors pe0 to pe3.
On the other hand, when a data distribution indicating statement xe2x80x9cc$distibute A(block)xe2x80x9d is inserted in a program declarative section of a sequential execution source program (e.g., 1:25, 26:50, 51:75, and 76:100 are specified in lines 4 to 7 of FIG. 11, which will be described later), the data are equally distributed to processors pe0 to pe3 in the same way as for FIG. 10A.
The data distribution method of the first touch scheme and that using the data distribution indicating statement have been described, for example, in pages 334 to 345 of xe2x80x9cData Distribution Support on Distributed Shared Memory Multiprocessorsxe2x80x9d written by Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nedeljkovic, and Jennifer M. Anderson (Sigplan""97 Conference on Programming Language Design and Implementation (PLDI) Las Vegas, Nev., Jun. 15-18, 1997).
In the simple first touch data distribution method described above, if a data access pattern in the initialization loop does not match that in a kernel loop (a loop requiring a longest execution time among the loops in the entire program), when a parallel program obtained by converting a sequential execution source program is executed, data locality in the kernel loop is deteriorated. In the simple first touch scheme, this consequently is one of the causes which hinder improvement of the parallel program processing speed. For example, in a situation in which a program is equally distributed to four processors pe0 to pe3 as shown in FIG. 10A, when a subroutine of a kernel loop in which variable i in lines 33 to 35 of FIG. 9 ranges from one to 60 for repetitious processing is 10000 times repeatedly executed, if the elements of array A are not entirely allocated to the respective memories of processors pe which execute the processing, it is necessary to access a faraway memory location to acquire the elements. This resultantly lowers the processing speed.
Moreover, in the data distribution method using the simple data distribution indicating statement, there possibly exists data distribution which cannot be easily expressed by an indicating statement. Therefore, data cannot be optimally distributed. In such a situation, when the simple data distribution indicating statement is used, data locality is possibly deteriorated. This results in one of causes which prevent improvement of the processing speed of the parallel program generated.
For example, when sequential execution source program 11 shown in FIG. 9 is inputted to a compiler and is converted into a parallel program, if there are four processors and the first touch data distribution is adopted, elements of array A are allocated as shown in FIG. 10A by an initial loop (lines 23 to 25 of FIG. 9) of procedure init which first refers to array A. Namely, A(1:25), A(26:50), A(51:75), and A(76:100) are allocated to pe0 to pe3, respectively. However, a kernel loop (lines 33 to 35 of FIG. 9) of procedure kernel refers to array A in the following ranges, i.e., A(41:55), A(56:70), A(71:85), and A(86:100) for pe0 to pe3, respectively. As can be seen from FIG. 10C, (41:70) and (76:85) of array A are data reference objects assigned to another processor, namely, are associated with remote reference (R). Resultantly, 66.7% of all data reference is made through the remote reference (R). In the situation of FIG. 10B, local reference (L) to access data allocated to own processor takes place quite little, namely, only the entire data of processor pe3 and part of data of processor pe2 are accessed by local reference (L) In the data allocation employing a simple data distribution indicating statement, it is difficult to indicate data distribution shown in FIG. 10B.
It is therefore an object of the present invention to provide a parallel program generating method in which data is optimally distributed by the kernel loop to thereby improve data locality to increase the processing speed of the parallel program.
To achieve the object in accordance with the present invention, there is provided a parallel program generating method in which loops to be paralleled are detected and then a kernel loop is detected in the loops. Next, a first touch control code is generated and then the code is placed before a first execution loop of a main program, for example, before a first position of execution statements of the main program or the code is placed immediately before the kernel loop to thereby produce a parallel program. By this operation, when sequential execution source program 11 of FIG. 9 is inputted to a compiler, A(1:25) and A(41:55) are allocated to pe0, A(26:40) and A(56:70) are allocated to perl, and A(71:85) and A(86:100) are respectively allocated to pe2 to pe3 as shown in FIG. 10D. This improves data locality in the kernel loop and can resultantly increases the parallel program processing speed.
Additionally, in the parallel program generating method of the present invention, it is also possible that profile information, compiler static analysis information, or user indication information is obtained to generate a first touch control code such that a parallel program is generated by placing the code, for example, at a first position of execution statements.
Moreover, in the parallel program generating method of the present invention, it is also possible that profile information, compiler static analysis information, or user indication information is obtained to produce a page allocation information to generate a parallel program in which the page allocation information is inserted.
First, description will be given of terms used in the following embodiments and a correspondence thereof to drawings.
{circle around (1)} A paralleling compiler (2 of FIG. 10 is a compiler which receives as an input thereto a sequential execution source program (1 of FIG. 10 described in a high level language and produces as an output therefrom a parallel program (3 of FIG. 3) for parallel execution.
{circle around (2)} A program top version first touch control method is a method in which a dummy loop to reproduce a data access pattern of the kernel loop is placed, for example, at a first position of execution statements of the main program to control first touch data distribution (reference is to be made to FIG. 2; first embodiment).
{circle around (3)} A loop front version first touch control method is a method in which a dummy loop which copies, while producing a data access pattern of the kernel loop, data of a data distribution objective array onto a clone array having an array form of the data distribution objective array is placed immediately before the kernel loop to thereby control first touch data distribution (reference is to be made to FIG. 3; second embodiment).
{circle around (4)} A profile information version first touch control method is a method wherein a dummy loop which causes a processor, according to profile information, to refer to a page most frequently referred to by the processor is placed at a first position of execution statements of the main program to thereby control first touch data distribution (reference is to be made to FIG. 3; third embodiment). In this regard, profile information includes various information obtained by once executing, for example, a parallel program generated in a method of the background art and indicates the number of accesses of each processor to each page for reference.
{circle around (5)} A static analysis information version first touch control method is a method wherein a compiler generates a dummy loop which causes, according to static analysis information of the compiler, a processor to refer to a page including array elements to be allocated to the processor. The dummy loop is placed at a first position of execution statements of, for example, the main program to thereby control first touch data distribution (reference is to be made to FIG. 5; fourth embodiment). In this connection, static analysis information is analysis information which the compiler can automatically analyze.
{circle around (6)} A user indication information version first touch control method is a method wherein a dummy loop which causes, according to user indication information, a processor to refer to array elements of a page to be allocated to the processor is placed at a first position of, for example, execution statements to thereby control first touch data distribution (reference is to be made to FIG. 5; fifth embodiment). Incidentally, user indication information is information which is indicated by a user, e.g., a programmer having generated a sequential source program, the information not being analyzed by the compiler itself. This method may have a flow substantially equal to the flow of the static analysis information version first touch control method, and the processing is conducted by referring to an array reference range table or the like indicated by the user in place of the static analysis information.
{circle around (7)} A profile information version data distribution control method is a method in which for each page, information of a processor which most frequently refers to the page is obtained from profile information and is then inserted into an object code to thereby cause an operating system to optimally distribute data (reference is to be made to FIG. 6; sixth embodiment). In this method, the object code is inserted in a lower section of the program code such that the operating system (OS) allocates the data according to the object code.
{circle around (8)} A static analysis information version data distribution control method is a method in which information of pages to be allocated to each processor is obtained from the static analysis information of the compiler and is inserted into an object code to thereby cause an operating system to optimally distribute data (reference is to be made to FIG. 7; seventh embodiment). This method is different only in that the information is inserted into the object code using the static analysis information in place of the profile information.
{circle around (9)} A user indication information version data distribution control method is a method in which information of pages to be allocated to each processor is obtained from information indicated by a user and is then inserted into an object code to thereby cause an operating system to optimally distribute data (reference is to be made to FIG. 7; eighth embodiment). This method is different only in that the information is inserted into the object code using the user indication information in place of the static analysis information.