In executing a parallel application program in a parallel computer system, processing is carried out in parallel while each of a plurality of processes repeat arithmetic processing and interprocess communication processing, mainly using a Message Passing Interface (MPI). This interprocess communication processing is executed between processes in a node of the parallel computer system, and is also executed between processes of different nodes. At that time, since arithmetic processing times are different between processes, start times of communication processing may be different between the processes.
For example, as illustrated in FIG. 1, when process P0, process P1, and process P2 execute a parallel application program, each of process P0 and process P1, which have completed arithmetic processing earlier, tries communication processing with process P2. However, since process P2 is under arithmetic processing and prevented from starting communication processing, each of process P0 and process P1 waits for the completion of the arithmetic processing of process P2. In this period, since process P0 and process P1 execute neither arithmetic processing nor communication processing, usage efficiency of the parallel computer system is reduced, and a parallel performance is deteriorated.
As a solution to this problem, a method is possible in which the developer of the parallel application program performs code rewriting, or parameter tuning, etc., which equalizes the arithmetic processing time for each process, so that the start times of communication processing are matched between processes.
As a method for confirming whether or not the start times of communication processing are matched between processes, a confirmation method based on the size of the value called a synchronization waiting time is known. The synchronization waiting time is obtained for example as follows:
1. A start time is acquired for each communication processing of each process. This start time can be acquired, for example, as an elapsed time from the execution start of each process;
2. The maximum value of the start times of the plurality of processes is determined for each communication processing; and
3. A difference between the maximum value and the start time of communication processing of each process is determined, differences related to a plurality of pieces of communication processing are accumulated for each process, and the accumulated difference is recorded as the synchronization waiting time.
The synchronization waiting time of each process becomes zero when the start times of all communication processing agree among the processes, and becomes close to the elapsed time taken for execution of the parallel application program as the differences in the start times among the processes become larger. Therefore, as a synchronization waiting time becomes closer to zero, the state can be determined to be more desirable.
In first communication processing in FIG. 1, start times of process P0, process P1, and process P2 are 20, 10, and 30 respectively, and the maximum value of the start times is 30, as indicated by an arrow 101. Differences between the maximum value 30 and the start times of process P0, process P1, and process P2 are 10, 20, and 0, respectively.
In second communication processing, start times of process P0, process P1, and process P2 are 60, 70, and 50 respectively, and the maximum value of the start times is 70, as indicated by an arrow 102. Differences between the maximum value 70 and the start times of process P0, process P1, and process P2 are 10, 0, and 20, respectively.
As a result, if differences related to the first and second communication processing are accumulated, synchronization waiting times of process P0, process P1, and process P2 are 20, 20, and 20, respectively. In this case, the elapsed time taken for execution of the parallel application program is 80, and 20 out of 80 can be interpreted as wasted time for waiting for completion of arithmetic processing of the other process.
In the following description, a parallel application program may be referred to as a parallel application.
The parallel application performance profiling tool of Cray Inc., U.S.A., hooks the collective communication function of MPI in order to determine the synchronization waiting time, and automatically calls up an interprocess synchronization interface (MPI_Barrier function) before a start of communication processing. Then, the parallel application performance profiling tool determines the sum value of the elapsed times of the MPI_Barrier function for each process.
A reduction operation apparatus configured to carry out a reduction operation determining the total sum, the maximum value, the minimum value, etc., of data, targeted at data possessed by a plurality of processes, is also known (See Patent Document 1).    Patent Document 1: Japanese Laid-open Patent Publication No. 2010-122848