1. Field of the Invention
The present invention relates to a method of executing a program at a high-speed through a distributed-memory parallel processor, and more specifically to a data updating method using an overlap area and a program converting device for converting a data update program.
2. Description of the Related Art
Recently, a parallel processor draws people""s attention as a system of realizing a high-speed processor such as a super-computer in the form of a plurality of processing elements (hereinafter referred to as PE or processors) connected through a network. In realizing a high processing performance using such a parallel processor, it is an important problem to reduce the overheads of the data communications to the lowest possible level. One of the effective PE of reducing the overheads of the data communications between processors is to use an overlap area for specific applications.
The time required for data communications depends on the number of times of packet communications rather than the total volume of data. Therefore, integrating the communications and representing the messages by vector (S, WP, K. Kennedy, and C. WP xe2x80x9cCompiler WP for Fortran D on WP Distributed-Memory Machines,xe2x80x9d in Proc. WP ""91 pp. 86-100, Nov. 1991.) are important in reducing the communications overheads. An overlap area is a special type of buffer area for receiving vector data, and is assigned such that it encompasses a local data area (local area) to be used in computing data internally. The data value of the overlap area is determined by the adjacent processor.
FIG. 1 shows the program code (Jacobi code) of the Jacobi relaxation written in high performance Fortran (HPF). In the Jacobi code shown in FIG. 1, the values of element a (i, j) of the two-dimensional array A are updated using the values of four adjacent elements a (i, j+i), a(i, jxe2x88x921), a(i+1, j), and a(ixe2x88x92,j). The size of the array a is specified by a (256, 256). The elements where i=1, 256, j=1, 256 are not updated. For example, the element a(2:255) in the DO loop of the update of data is an array description of Fortran 90, and the number of times of occurrences of the DO loop is t times. This code refers to a typical example of the update of data using an overlap area.
FIG. 2 shows an example of the overlap area in which the Jacobi code shown in FIG. 1 is executed. According to the data distribution specified in the program shown in FIG. 20, the elements in the array a (256, 256) are distributed into the local areas of 16 processors P (x, y) (x=1, 2, 3, 5, y=1, 2, 3, 4) and stored therein. For example, the processor p (2, 2) controls the range of a (65:128, 65:128) in the array a. In FIG. 2, the shadowed portion around the local area of the processor p(2,2) indicates the overlap area at p(2, 2).
The processor p(2, 2) has a considerably large area of a(64:129, 64:129) including an overlap area so that, when a(i, j) is calculated, the adjacent a(i, j+1), a(i, jxe2x88x921), a(i+1, j), and a(ixe2x88x921, j) can be locally accessed.
Without an overlap area, data should be read from adjacent processors in the DO loop and a small volume of data are frequently communicated, resulting in a large communications overheads. However, having an overlap area allows the latest data to be copied to the overlap area by collectively transferring data before an updating process. Therefore, data can be locally updated and the communications overheads can be considerably reduced.
Thus, the overlap area can be explicitly specified by VPP Fortran (xe2x80x9cRealization and Evaluation of VPP Fortran Process System for AP1000xe2x80x9d Vol. 93-HPC-48-2, pp. 9-16. Aug. 1993 published at SWOPP Tomonoura ""93 HPC Conference by Tatsuya Sindoh, Hidetoshi Iwashita, Doi, and Jun-ichi Ogiwara). A certain compiler automatically generates an overlap area as a form of the optimization.
The data transmission patterns for performing parallel processes can be classified into two types. One is a single direction data transfer SDDT, and the other is a bi-directional data transfer BDDT. FIG. 3 shows an example of the SDDT, and FIG. 4 shows an example of the BDDT.
In FIGS. 3 and 4, processors ixe2x88x921, i+1, and i+2 are arranged in a specified dimension and forms a processor array. The SDDT is a transfer method in which all transfer data are transferred in a single direction from the processor i toward the processor i+1 with time in the specified dimension. The BDDT is a transfer method in which data is transferred between adjacent processors in two directions. Thus, some pieces of data are transmitted from the processor i to the processor i+1 while other pieces of data are transmitted from the processor i+1 to the processor i.
FIG. 5 shows the program code of the Jacobi relaxation for a one-dimensional array. In the Jacobi code shown in FIG. 5, the value of the element a(i) of the one-dimensional array a is updated by the output of a function f obtained by inputting to the function f the two adjacent elements a(ixe2x88x921) and a(i+1). The size of the array a is specified by a(28), and a(1) and a(28) are not updated. The data is updated repeatedly for the time specified by time.
FIG. 6 shows an example in which data is updated using the conventional overlap area when a program shown in FIG. 5 is executed. In FIG. 6, PE0, PE1, PE2, and PE3 are four PEs for dividing and managing the array a. Each PE has an area for storing 9 array elements. A dirty overlap area stores old data and a clean overlap area stores the same latest data as the adjacent PE. A local area stores data to be processed by each PE.
The word xe2x80x9cINITxe2x80x9d indicates an initial state and xe2x80x9cUpdatexe2x80x9d indicates the data communications between adjacent PEs to update the overlap area. Iter 1, 2, 3, and 4 indicate parallel processes for the update of data at each iteration of the DO loop. In FIG. 6, the overlap area is updated by the BDDT for each iteration.
However, the data update method using the conventional overlap area has the following problems.
Each processor forming part of the parallel processor should update the data in the overlap area into the latest value before making a calculation using the data value of the overlap area. The update process is performed by reading the latest value from the adjacent processor through the communications between processors. In parallel processors, the overheads are heavy for a rise time. Therefore, the time required for the communications process depends on the number of times of data transfers rather than the amount of transferred data. If an overlap area is updated each time a calculation is made using the overlap area, then each communications rise time is accompanied by overheads.
In a parallel processor connected through a torus network such as an AP1000 (xe2x80x9cAn Architecture of Highly Parallel Computer AP1000,xe2x80x9d by H. Ishihata, T. Horie, T. Shimizu, and S. Kato, in Proc. IEEE Pacific Rim Conf. on Communications, Computers, and Signal Processing, pp. 13-16, May 1991), the SDDT excels to the BDDT in characteristic because the SDDT can reduce the time of data transfers and the overheads required in a synchronization process between adjacent processors more than the BDDT. However, in the conventional data update process as shown in FIG. 6, the data in the overlap areas should be exchanged between adjacent processors, and the data transfer pattern is based on the BDDT. In the BDDT, each processor should perform communications in synchronism with adjacent processors. As a result, the time of data transfers increases and the overheads for the synchronization processes become heavier than the SDDT.
3. Summary of the Invention
The present invention aims at updating data with the overheads for the communications between PEs reduced in the distributed-memory parallel processors, and providing a program converting device for generating a data updating program.
The program converting device according to the present invention is provided in an information processing device, and converts an input program into the program for a parallel processor. The program converting device is provided with a detecting unit, setting unit, size determining unit, and a communications change unit.
The detecting unit detects a portion including the description of the loop where optimization can be realized using an overlap area in the input program. The setting unit assigns an overlap area to the memory of the PE for processing the program at the description of the loop, generates a program code for calculating the data in the area, and then adds it to the initial program. Thus, each PE updates the data in the local area managed by the PE, and also updates the data in the overlap area managed by other PEs at the runtime of the program converted by the parallel processor. The overlap area updated by the closed calculation in each PE requires no data transfer for update, thereby improving the efficiency in parallel process.
The size determining unit estimate the runtime for the description of the loop and determines the optimum size of the overlap area. Normally, the larger the overlap area is, the smaller number of times the data is transferred while the longer time is taken for updating the data in the area. If the size of an overlap area is fixed such that the runtime is the shortest possible, the data update process can be efficiently performed.
The communications change unit checks the data dependency at the detected portion of the description of the loop. If the data is dependent bi-directionally, the description should be rewritten such that the data is dependent in a single direction, and subscripts are generated in the arrangement optimum for data transfer. Thus, each PE only has to communicate with the adjacent PE corresponding to either upper limit or lower limit of the subscripts in the array, thereby successfully, reducing the overheads of the communications.
Thus, the overlap area has been updated using the data transferred externally. However, it is updated in a calculation process in each PE, thereby reducing the overheads for the communications and performing the parallel process at a high speed.