1. Field of the Invention
The present invention relates to a method of executing a program at a high-speed through a distributed-memory parallel processor, and more specifically to a data updating method using an overlap area and a program converting device for converting a data update program.
2. Description of the Related Art
Recently, a parallel processor draws people's attention as a system of realizing a high-speed processor such as a super-computer in the form of a plurality of processing elements (hereinafter referred to as PE or processors) connected through a network. In realizing a high processing performance using such a parallel processor, it is an important problem to reduce the overheads of the data communications to the lowest possible level. One of the effective PE of reducing the overheads of the data communications between processors is to use an overlap area for specific applications.
The time required for data communications depends on the number of times of packet communications rather than the total volume of data. Therefore, integrating the communications and representing the messages by vector (S. WP, K. Kennedy, and C. WP "Compiler WP for Fortran D on WP Distributed-Memory Machines," in Proc. WP '91 pp.86-100, November 1991.) are important in reducing the communications overheads. An overlap area is a special type of buffer area for receiving vector data, and is assigned such that it encompasses a local data area (local area) to be used in computing data internally. The data value of the overlap area is determined by the adjacent processor.
FIG. 1 shows the program code (Jacobi code) of the Jacobi relaxation written in high performance Fortran (HPF). In the Jacobi code shown in FIG. 1, the values of element a (i, j) of the two-dimensional array A are updated using the values of four adjacent elements a (i, j+i), a(i, j-1), a(i+1, j), and a(i-1,j). The size of the array a is specified by a (256, 256). The elements where i=1, 256, j=1, 256 are not updated. For example, the element a(2:255) in the DO loop of the update of data is an array description of Fortran 90, and the number of times of occurrences of the DO loop is t times. This code refers to a typical example of the update of data using an overlap area.
FIG. 2 shows an example of the overlap area in which the Jacobi code shown in FIG. 1 is executed. According to the data distribution specified in the program shown in FIG. 20, the elements in the array a (256, 256) are distributed into the local areas of 16 processors P (x, y) (x=1, 2, 3, 5, y=1, 2, 3, 4) and stored therein. For example, the processor p (2, 2) controls the range of a (65:128, 65:128) in the array a. In FIG. 2, the shadowed portion around the local area of the processor p(2, 2) indicates the overlap area at p(2, 2).
The processor p(2, 2) has a considerably large area of a(64:129, 64:129) including an overlap area so that, when a(i, j) is calculated, the adjacent a(i, j+1), a(i, j-1), a(i+1, j), and a(i-1, j) can be locally accessed.
Without an overlap area, data should be read from adjacent processors in the DO loop and a small volume of data are frequently communicated, resulting in a large communications overheads. However, having an overlap area allows the latest data to be copied to the overlap area by collectively transferring data before an updating process. Therefore, data can be locally updated and the communications overheads can be considerably reduced.
Thus, the overlap area can be explicitly specified by VPP Fortran ("Realization and Evaluation of VPP Fortran Process System for AP1000" Vol. 93-HPC-48-2, pp. 9-16, August 1993 published at SWOPP Tomonoura '93 HPC Conference by Tatsuya Sindoh, Hidetoshi Iwashita, Doi, and Jun-ichi Ogiwara). A certain compiler automatically generates an overlap area as a form of the optimization.
The data transmission patterns for performing parallel processes can be classified into two types. One is a single direction data transfer SDDT, and the other is a bi-directional data transfer BDDT. FIG. 3 shows an example of the SDDT, and FIG. 4 shows an example of the BDDT.
In FIGS. 3 and 4, processors i-1, i, i+1, and i+2 are arranged in a specified dimension and forms a processor array. The SDDT is a transfer method in which all transfer data are transferred in a single direction from the processor i toward the processor i+1 with time in the specified dimension. The BDDT is a transfer method in which data is transferred between adjacent processors in two directions. Thus, some pieces of data are transmitted from the processor i to the processor i+1 while other pieces of data are transmitted from the processor i+1 to the processor i.
FIG. 5 shows the program code of the Jacobi relaxation for a one-dimensional array. In the Jacobi code shown in FIG. 5, the value of the element a(i) of the one-dimensional array a is updated by the output of a function f obtained by inputting to the function f the two adjacent elements a(i-1) and a(i+1). The size of the array a is specified by a(28), and a(1) and a(28) are not updated. The data is updated repeatedly for the time specified by time.
FIG. 6 shows an example in which data is updated using the conventional overlap area when a program shown in FIG. 5 is executed. In FIG. 6, PE0, PE1, PE2, and PE3 are four PEs for dividing and managing the array a. Each PE has an area for storing 9 array elements. A dirty overlap area stores old data and a clean overlap area stores the same latest data as the adjacent PE. A local area stores data to be processed by each PE.
The word "INIT" indicates an initial state and "Update" indicates the data communications between adjacent PEs to update the overlap area. Iter 1, 2, 3, and 4 indicate parallel processes for the update of data at each iteration of the DO loop. In FIG. 6, the overlap area is updated by the BDDT for each iteration.
However, the data update method using the conventional overlap area has the following problems.
Each processor forming part of the parallel processor should update the data in the overlap area into the latest value before making a calculation using the data value of the overlap area. The update process is performed by reading the latest value from the adjacent processor through the communications between processors. In parallel processors, the overheads are heavy for a rise time. Therefore, the time required for the communications process depends on the number of times of data transfers rather than the amount of transferred data. If an overlap area is updated each time a calculation is made using the overlap area, then each communications rise time is accompanied by overheads.
In a parallel processor connected through a torus network such as an AP1000 ("An Architecture of Highly Parallel Computer AP1000," by H. Ishihata, T. Horie, T. Shimizu, and S. Kato, in Proc. IEEE Pacific Rim Conf. on Communications, Computers, and Signal Processing, pp. 13-16, May 1991), the SDDT excels to the BDDT in characteristic because the SDDT can reduce the time of data transfers and the overheads required in a synchronization process between adjacent processors more than the BDDT. However, in the conventional data update process as shown in FIG. 6, the data in the overlap areas should be exchanged between adjacent processors, and the data transfer pattern is based on the BDDT. In the BDDT, each processor should perform communications in synchronism with adjacent processors. As a result, the time of data transfers increases and the overheads for the synchronization processes become heavier than the SDDT.