1. Field of the Invention
The present invention relates to a high speed processing of indefinite loops and a compilation scheme for realizing a efficient program execution in a parallel processing system.
2. Description of the Background Art
First, the basic technique concerning the prefetching will be described, followed by the description of the dual buffer system, and finally the asychronous transmission in the parallel computer will be mentioned.
Conventionally, in order to speed up the successive execution of usual instructions, many machines employ the so called instruction prefetch buffer. A general instruction prefetch buffer is designed to hold two to eight continuous instructions. Every time the instructions are executed by the CPU one by one, the subsequent instruction words are prefeteched. This is the widely used technique of prefetching in which those instructions which have high probability for being executed in near future are loaded in advance, in parallel to the execution of the foregoing instructions, so as to conceal the time required for loading the instructions.
It has also been done conventionally to expand the target of this prefetching from the instructions to the data. In this case, those data which have high probability for being used in near future are loaded into the cache memory during the other data processing at the CPU, before these data actually become necessary. Here, however, when the data write for the data occurs at a period between the loading of the data as the prefetch data and the actual access to the data, the data loaded by the prefetch are going to be invalidated.
In general, it is more difficult to predict which data are going to be necessary in the near future compared with a case of the instruction. For this reason, there are researches directed toward the advance loading of those data which have high probability for being used. (For example, A. Rogers and K. Li. Software Support for Speculative Loads", SIGPLAN NOTICES Vol. 27, No. 9, Sep. 1992.)
Similarly to a case of the successive execution type computer described above, it is also possible to carry out the instruction prefetch in the MIMD type parallel computer in the similar manner. However, in the MIMD type computer, the data prefetch cannot be carried out in a manner similar to that in the successive execution type computer, because in a case of the parallel computer, a plurality of processors can change the data in parallel, so that there is a need to know when the data are valid and when the data are accessible. In this regard, Rettberg et al. have expressed their desire to carry out the data prefetch in the parallel computer in R. D. Rettberg, W. R. Crowther, P. P. Carvey, and R. S. Tomlinson. "The Monarch Parallel Processor Hardware Design", IEEE Computer, Vol. 23, No. 4, pp. 18-30, 1990, but they failed to provide any concrete scheme for realizing it. After this paper, their project had been abandoned in midstream, so that their research on the data prefetch in the parallel computer had also been interrupted before its completion.
In a case of carrying out the instruction prefetch in the MIMD type parallel computer, there is hardly any problem concerning the protocol and the time required for the prefetching because the instruction is not rewritable and usually stored in the own local memory of the processor. However, in a case of dealing with the data, the values of the data can be changed, and the data can be stored in the other processors, so that the protocol for the prefetching can be quite complicated, and the required time can often be quite enormous. On the other hand, when the prefetching is carried out speculatively, i.e., according to a strategy to prefetch those data which are considered as having high probability for being used even though they may not be used, the load at a time of the data transmission between the processors becomes high, even to such an extent to cancel out the effect of the prefetching in some cases.
In this regard, in a limited area of a specialized part such as an I/O driver, there is a technique called dual buffer system which has a configuration as shown in FIG. 1. In this dual buffer system, two buffers 103A and 103B are provided as a buffer 103 between one I/O device 101 and one CPU 105. Then, in order to read the data from the input device 101 to the CPU 105, while the input device 101 writes the data into one buffer 103A, the CPU 105 reads the data from another buffer 103B. In this manner, the conflict of the accesses with respect to the buffer 103 among the input device 101 and the CPU 105 is avoided to realize the efficient data input. This technique is employed in the specialized programming such as that for the device driver, but it is not utilized in the general user application program because of the lack of the compiler for generating the codes and carrying out the high level transformation required in this technique.
As for the asychronous transmission in the parallel computer, there is a known scheme for indicating whether the buffer is full or empty by means of a single flag. For example, this scheme is applicable to a configuration shown in FIG. 2 in which two processors 107 and 111 are sharing a shared memory 109 containing the data 109A and the flag 109B. In this case, in order to carry out the data write, it is necessary for each processor to wait until this flag 109B indicates "empty" while keep polling, and in order to carry out the data read, it is necessary for each processor to wait until this flag 109B indicates "full" while keep polling. Here, in a case of the PAX computer in which this flag can be provided on the shared memory between adjacent processors, there is hardly any problem for the time cost for making an access to the flag. However, in a general parallel computer in which the processors are connected through a network such that an access to the flag requires an access through the network, the time cost for making an access to the flag becomes significant, and the polling required in this scheme keeps the network in use continuously so that the load on the network becomes high, and it is even possible to interfere with the use of the network by the other processors.
Thus, in the conventional parallel computer, the compiler technique and the hardware support for realizing the data prefetch in the application program have bene unavailable.
Next, the conventional compiler technique will be described in further detail.
In the usual compiler, it has conventionally been done to carry out the calculation involving only constants at a time of compilation rather than at a time of execution. For example, the sentence: EQU x=1+2
an be transformed into the sentence: EQU x=3
by the compiler, before being further transformed into the execution code.
As a technique to utilize this function more actively, there is a parallel calculation scheme. Here, when a part of the inputs to the program are known, this program is partially calculated within a known range, and only the remaining inputs are set as variables, so as to transform the program into a more efficient one.
For example, in a case of the program shown in FIG. 3A, the value of x is known to be 10, so that this program of FIG. 3A can be transformed into that shown in FIG. 3B.
Also, in Uwe Meyer, "Techniques for Partial Evaluation of Imperative Language", SIGPLAN NOTICES, Vol. 26, No. 9, Sep. 1991, pp. 94-105, an example described by the language in accordance with PASCAL has been disclosed. This example uses a program as shown in FIG. 4A in which N and X are read out, and N-th power of X is calculated, and then the result is substituted into Y and written out at a prescribed position. Here, a symbol .phi. attached to X in this example indicates that the attached variable X is unknown. Here, when the input N is known to be 3, the program of FIG. 4A can be transformed into a program with a higher execution efficiency as shown in FIG. 4B.
In such a manner, before the execution, when the value of the variable is known by some means, it is possible to output the codes with a higher efficiency using the partial calculation scheme. However, this scheme cannot handle the variables which can only be predicted at a time of the execution.
In contrast, even for a case in which the values of the variables are totally unknown until the time of the execution, there is a scheme for carrying out the partial execution by setting the variables to be constants, by carrying out a part of the execution at a time of the compilation, as disclosed in Japanese Patent Application Laid Open No. 4-44181 (1992). In this scheme, the execution codes are generated in the following procedure, and then the generated execution codes are loaded as the execution codes into the computer later on and executed.
1. The pre-execution portions to be executed in advance during the compilation are specified in the program. Here, it is guaranteed that the variables specified at this stage are not going to be re-defined subsequently. (Note, however, that a manner of determining the pre-execution range is not described in the specification of the above identified Japanese Patent Application.)
2. The pre-execution portions are executed during the compilation by the compiler or the interpreter, to set the variables to be constants.
3. The partial calculation is carried out according to the variables set to be constants at the above stage, to generate the execution codes with high efficiency.
As an example of this scheme, for a program shown in FIG. 5A, the pre-execution portion can be specified as that within a range from PREEXEC START to PREEXEC END, such that this pre-execution portion can be executed at a time of the compilation by using the data file, and the values of the inputs M1 and M2 are set to be constants. In this manner, the loop repetition number for the subsequent DO loop can be set to be constant, so that the optimization of the execution codes becomes easier.
Now, consider a case of generating the execution codes which are to be operated on four processors, from the source program shown in FIG. 5A. In this case, the array A is divided into four cyclically, and given a name PA. Namely, the array A(i) corresponds to the PA(i/4) of the processor MOD(i-1, 4)+1. When this program is compiled in the usual manner, it becomes necessary to generate the complicated codes as shown in FIG. 6A. That is, the own processor number MYPE is taken out by the function PYPENUM(), and then the starting point N1 and the ending point N2 of the DO loop are obtained while paying attention to the fractions.
Here, if it can be determined that M1=1 and M2=16 by the execution of the pre-execution portion in the program of FIG. 5B. It is possible to transform the program of FIG. 5A into the program with a higher efficiency as shown in FIG. 6B.
However, there are two major problems in this scheme, as follows:
(1) It is difficult to determine the pre-execution range. PA1 (2) When the pre-execution range is widened, the amount of intermediate results becomes enormous. PA1 processor 1:A(*, 1) to A(*, 25) PA1 processor 2:A(*, 26) to A(*, 50) PA1 processor 3:A(*, 51) to A(*, 75) PA1 processor 4:A(*, 76) to A(*, 100) PA1 (1) A special hardware is required in this scheme. Namely, in order to extract the processor number and the processor internal number from the data ID without significantly lowering the execution speed, a special hardware for this purpose must be provided. PA1 (2) The total number of processors that can be handled in this scheme is limited to the powers of 2. Namely, in the usual binary computer, when the total number of processors is the power of 2, the processor number and the processor internal number can be expressed as a connected series of fields, so that the necessary field can be extracted by using the bit mask. However, in a case a total number of processors for executing the program is a number other than the powers of 2, it is going to require a divider which in turn makes it impractical.
Namely, in the example of FIGS. 5A, 5B, 6A, and 6B, the values of M1 and M2 can be determined by one data read, so that the pre-execution range can be determined easily. However, in general, the determination of the pre-execution range is essentially not an easy matter. The general reason why it is difficult to determine the pre-execution range is that the values of the variables can be determined not by just one data read, but by various substitutions of the values. Namely, in order to know a certain value, there is a need to know the values quoted in the equation for defining that certain value. However, when this logic is applied in chain, there is a possibility for the pre-execution range to be almost as wide as the actual execution range.
Namely, when the pre-execution range becomes wider for the reason described above, all the variables obtained within the pre-execution range must be embedded into the execution codes. Even for such a variable for which, in a case of not using the pre-execution, it has only been necessary to secure a region as the unknown region at a time of the execution and therefore the object is not going to be made larger, when the value is obtained by the pre-execution, there is a need for the obtained value to be contained as the ascertained value in the object. Consequently, in a case of making the large constant table by the simple algorithm, the object is going to be made larger.
Now, the program execution in the parallel computer will be considered. Namely, in the parallel computer connected with a host computer in a configuration shown in FIG. 7, where the host computer 213 has a compiler 214 for compiling a source program 215 to obtain execution codes 217, and the parallel computer formed by a plurality of processors 221 (221a to 221n) is connected with the host computer 213 though a host-processor network 219, while the processors 221 are connected with each other through an interconnection network 229 and equipped with respective memory regions 222 (222a to 222n), a case of executing the SPMD (Single Program Multiple Stream) type program will be considered.
Conventionally, the execution of such a program is carried out by the following procedure.
1. At the host computer 213, using the compiler 214 provided therein, the source program 215 is transformed into node programs which are executable at individual processors of the parallel computer.
2. At the host computer 213, using the compiler 214 provided therein, the node programs are compiled to produce the execution codes 217.
3. The execution codes 217 produced at the above stage are loaded into the parallel processors through the host-processor network 219, and stored in the memory regions 222 of the processors 221.
4. The execution codes loaded into the memory region 222 of each processor are then executed at each individual processor.
As a concrete example, there is a case of executing the program for the Jacobi method described in Fortran D by transforming it into the node programs, as can be found in Seema Hiranandani, Ken Kennedy, and Chau-Weng Tseng, "Compiling Fortran D for MIMID distributed-memory machines", CACM, Aug. 1992, Vol. 35, No. 8, pp. 66-80, which will now be described.
First, the original program for the general Jacobi method described in Fortran D is shown in FIG. 8. This program of FIG. 8 is that in which, for each element in 100.times.100 two dimensional array A, a substitution of an average of its upper, lower, right, and left neighbor elements is repeated to "time" times. Here, the array B is used as a back up of the array A. Then, consider a case of executing this program by the four parallel processors. In this case, for the sake of the parallelism, the array A is to be divided into four as follows:
where * denotes an arbitrary number.
Here, however, for the handing of the array element data at boundary portions, the data regions which are larger by one at each cutting plane of the array are actually secured. For this reason, in the program shown in FIG. 9 which is the node program obtained from the original program of FIG. 8, the array B is expressed as B(100, 0:26) rather than B(100, 25). (Note that, in a case the size of the array is specified by one integer, the lower bound is 1, so that 25 in the latter notation implies suffix ranging from 1 to 25. On the other hand, in a case of setting the lower bound different from 1, the lower bound is explicitly declared next to the upper bound with a colon inserted therebetween, so that 0:26 in the former notation implies the suffix ranging from 0 to 26.)
In the program of FIG. 9, "Plocal" indicates a processor number ranging from 1 to 4, and "Pleft" and "Pright" indicate the processor numbers (1 to 4) of the processors which have the left and right neighbor elements at the boundaries of the array, respectively. Also, lb1 and ub1 indicate the lower and upper bounds of the operation targets of the array within each processor, respectively. These values are subtly different depending on the position of the processor. Also, the data transmission and reception patterns are going to be different depending on the position of the processor, so that the there is a need to set different cases for the data transmission and reception at "if" sentence. As such, this node program of FIG. 9 involves the case setting according to "Flocal", which did not exist in the original program of FIG. 8. Consequently, there is going to be an execution overhead due to this case setting.
There is also a proposition for a scheme in which the efficient execution can be realized even when a total number of processors acquired at a time of the execution varies, while simply compiling at the host side, as discloses in Japanese Patent Application Laid Open No. 62-274451 (1987). This is a scheme in which, as shown in FIG. 10, a data ID 300 is regarded as a connected series of a processor number 300A and a processor internal number 300B. In this case, even when a total number of processors changes, the processor number and the processor internal number for data can be extracted from the data ID at a time of the execution.
However, this scheme is associated with the following two serious problems:
Thus, there has been a need for a scheme which can be processed at a high speed by software alone, without any constraint on the total number of processors.
As described, the execution codes containing the variables that can be set to be constants as variables are lower in execution efficiency. As an improvement in this regard, there has been an attempt for generating the execution codes with a higher efficiency by making the partial calculations based on the information available at a time of compilation statically, but such an attempt can achieve only insufficient improvement. Especially in a case of the parallel execution of the program, a large number of control codes are going to be inserted among the source codes due to the parallelism, and these control codes can cause the large overhead. Also, the conventionally available scheme can only utilize the information that can be determined during the compilation, so that the information, such as the total number and the configuration of the processors acquired at a time of the execution, which is essentially available only at a time of the execution, could not have been utilized in the compilation conventionally.