The present invention generally relates to software adapted to run on a distributed-memory type parallel computer system. More particularly, this invention is concerned with a parallelization translator tool (such as a parallelizing compiler, parallelizing translator, parallelization supporting tool and the like) for translating a program described to be executed by a serial processing computer system into a program susceptible to execration by a distributed-memory parallel computer.
There is known a computer system constituted of a plurality of processing elements in which each of the processing elements (hereinafter referred to as PE for short) is provided with a memory. This type of computer system is known as a distributed-memory type parallel computer. The memory provided in association with each of the PEs is termed the local memory.
For executing processings in the distributed-memory type parallel computer system, the data used is allocated distributively to the local memories of the individual PEs, respectively, wherein processing such as arithmetic operations and others are executed by the PEs in parallel to one another. In that case, when there arises a need for making reference to the data allocated to the local memory of the other PE, the data has to be transferred to the requesting PE through an interprocessor or inter-PE communication.
As a processing scheme or system for a large scale array appearing, for example, in scientific data processing or the like, there is proposed a system in which individual elements of the array ape distributively allocated to the PEs, respectively, wherein the processings are performed in parallel on an element-by-element basis. In the following processing procedure for the array, elements will be described on the presumption that an array element allocation method has already been given.
By way of example, consider a processing described in a serial processing FORTRAN program as follows: ##EQU1## It is assumed that this processing is to be executed by a parallel computer. Further, it is assumed that the elements of the arrays A, B and C ape serially allocated one by one to the local memories of the individual PEs, respectively, as illustrated fin FIG. 5 of the accompanying drawings.
In the parallel computer, this loop is executed in a manner described below. For each of "I"s, a value of C(I+2) is transferred from the PE (hereinafter referred to as PE(2)) having C(I+2) to the PE (referred to as PE(0)) having A(I) and B(I), whereon the PE(0) executes an assignment statement by using the value or C(I+2) transferred thereto. These processings are executed in parallel by the PEs for each of "I"s.
The processing described above is executed in accordance with a program oriented for the parallel computer. For creating an equivalent program for the parallel computer based on a program described for serial processing, a parallelizing compiler, a parallelization supporting tool or the like is employed. The parallelizing compiler finds out from a serial processing program those portions which can be parallelized and performs allocation of the portions to the individual PEs, insertion of inter-PE communication and so forth to thereby output a parallel processing program. On the other hand, the parallelization supporting tool is designed to support or aid the operator in manually creating the parallel program by analyzing the program and forwarding various information in case the optimum parallelization can not be realized with only the aid of the parallelizing compiler.
When two or mope data allocated to other PEs ape to be referred to during execution of an expression, a part of operation appearing in the expression must be executed by another PE. By way of example, consider a method of executing by a parallel computer a processing mentioned below: ##EQU2## It is assumed that the elements of the arrays A, B, C and D are allocated serially one by one to the local memories of the individual PEs, as in the case of the example shown in FIG. 5. In that case, A(I) and B(I) are resident on a same PE, while C(I+2) and D(I+2) are also on a same PE regardless of the value which "I" assumes.
There are conceivably two methods for executing the loop by a parallel computer system. According to a first method, values of C(I+2) and D(I+2) are transferred from the PE(2) to the PE(0), whereon assignment statement is executed by the PE(0) by using the values of C(I+2) and D(I+2) transferred to the PE(0). On the other hand, according to a second method, a value of a partial expression of C(I+2).times.D(I+2) is determined for each "I" and then transferred to the PE(0) which executes then the assignment statement by using the value as transferred thereto. In other words, the first method features the allocation of all the operations appearing in the expression to the PE(0), while the second method features the allocation of multiplication appearing in the expression to the PE(2) with the addition being allocated to the PE(0).
In the case of the first method, the number of data to be transferred is two for each "I". In contrast, in the second method, the number of data to be transferred is one. In this conjunction, it is noted that in the case of the distributed-memory type parallel computer, the number of the inter-PE data transfers should be as small as possible because the inter-PE data transfer takes lots of time when compared with the reference to the data within the PE. Accordingly, it can be said that the second method is excellent over the first in this respect.
As will be understood from the abovementioned example, the number of times the data is transferred can be reduced by contriving a method of assigning the PEs to operations appearing in an expression. An article in which a method for optimal PE assignment is discussed, is by John R. Gilbert and Robert Schreiber, "Optimal Expression Evaluation for Data Parallel Architectures", Journal of Parallel and Distributed Computing, Vol. 13, No. 1, pp. 58-64, September, 1991. According to this known method, an expression is represented in the form of a tree having leaves (leaf nodes) representing data and interior nodes representing operators. At first, the tree is traced in a bottom-up fashion to attach each of the interior nodes with a set of candidates of PEs to be assigned to the operation represented by each interior node. Subsequently, the tree is traced in a top-down fashion to determine the PE which is to be assigned to the operation represented by the interior node.
In the case of the system disclosed in the abovementioned article, however, operations in the expression are limited to those having two operators such as the four arithmetic operations. Thus, although the method disclosed in the above article can be applied to such an expression as given by EQU B(I)+C(I+2).times.D(I+2)
it can not be applied to an expression which contains a function FUN having three parameters such as follows: EQU B(I)+FUN(C(I+2), D(I+2), E(I+1))
As an example of the function having three or more parameters in an expression, there may be mentioned a function for determining solutions of a quadratic equation by using three coefficients as parameters.
Further, in the abovementioned article, discussion is also made of a method of determining the PE assignment in the case where the sequence or order of the operations can be altered by making use of the commutative law of the four operations. Unfortunately, this method can not always ensure the optimum assignment.
It is further noted that in the abovementioned article, discussion is also directed to such assignment of the PEs to the operation that the cost involved in the inter-PE data transfers, which is not constant but variably dependent on the transfer source PE and the transfer destination PE, can be reduced to a minimum. In that case, however, the cost is required to meet the condition of "robustness".