A parallel computer including a plurality of computers performs parallel processing of a program by executing a plurality of processes in parallel. Here, the process is a unit of parallel processing. Each computer includes a plurality of central processing unit (CPU) cores, and, in each computer, one or more processes are executed in parallel. Communication between processes is performed using message passing interface (MPI).
Each process is executed partially with a plurality of threads. For example, when a loop that is repeatedly executed with a loop variable I being varied from 1 to 1000 is contained in a program and the number of threads is four, processing is performed in parallel by using four threads of I=1 to 250, I=251 to 500, I=501 to 750, and I=751 to 1000. One CPU core is assigned to one thread.
The number of processors and the number of threads may be specified by the user of a parallel computer. However, the number of threads falls within the number of CPU cores. In addition, parallel processing performed by a plurality of processes each of which one or more threads are assigned to is referred to as hybrid parallel processing.
In order to efficiently perform parallel processing, it is important to make the processing times of all the processes the same. The processing time of a process is dependent on the number of threads. There is known a technique in which, in each process, the number of threads of the process is dynamically set again based on the processing time, and thereby the processing times of all the processes are made the same.
There is also known a technique in which the performance values of a computer system are estimated by performing a simulation based on a series of component program sequences executed, the timings at which they are executed, and the performance values of the component programs, and thus the design of a computer system that satisfies the desired performance is supported.
There is also known a technique in which each thread stores a measured performance metric in its corresponding region and a region corresponding to the parent thread of memory, and, when the process is complete, a profiler scans through the memory and sums the performance metrics, enabling the performance metrics to be analyzed at the thread or process level.
Japanese Laid-open Patent Publication No. 2011-180725, Japanese Laid-open Patent Publication No. 2004-272582, and Japanese Laid-open Patent Publication No. 9-237203 are known as related art examples.