Programs used in a parallel computer system including a plurality of computing nodes include a master-worker type parallel program. Master-worker type parallel programs include a single-layer master-worker type parallel program and a multi-layer master-worker type parallel program.
In a single-layer master-worker type parallel program, a computing task to be executed by the parallel program is divided into a plurality of subtasks. Each subtask may be executed on an arbitrary computing node belonging to a computer cluster. The subtask, execution of which is started, terminates within a finite time period in the absence of an abnormality in the computing node (e.g., a hardware fault or a software fault in the computing node).
There are times when execution of a given subtask depends on a result of execution of a different subtask. For example, when subtask A is dependent on a result of execution of subtask B, execution of subtask A is started after termination of execution of subtask B. Subtasks having no interdependence may be executed concurrently (in parallel).
In the single-layer master-worker type parallel program, one master process and a plurality of worker processes are generated. These processes are generated on computing nodes available for the parallel program (all computing nodes of a computing cluster may be available or computing nodes alone which are allocated units called jobs may be available).
The master process performs management and execution control of subtasks. Each worker process executes a subtask at the request of the master process. The master process searches for a subtask, execution of which may be started, and requests a worker process to execute the subtask. The requested worker process starts executing the subtask. When the execution of the subtask terminates, the worker process notifies the master process of the termination of the execution and results of the execution. The master process compiles execution results. With termination of execution of all the subtasks, the task of the parallel program terminates.
The master process also monitors the status (operation) of each worker process. The master process may sense whether a subtask is being normally executed by a method, such as periodical communication with each worker process. When a subtask terminates abnormally due to an abnormality in a computing node, the master process determines that a worker process having executed the abnormally terminated subtask is in an abnormal state. The master process excludes the worker process determined as being in an abnormal state from a group of available worker processes. At this time, the master process requests a different worker process to execute the subtask having been executed by the worker process determined as being in an abnormal state.
In a multi-layer master-worker type parallel program, one master process, a plurality of submaster processes having a fixed hierarchical structure, and a plurality of worker processes are generated. The master process does not directly monitor a worker process but monitors a submaster process. Each submaster process monitors a worker process subordinate to itself. In this manner, overall monitoring of the worker processes is performed.
For further information, see Japanese Laid-Open Patent publication No. 2011-118899, Japanese Laid-Open Patent publication No. 2005-242986, Japanese Laid-Open Patent publication No. 2003-280933, and Japanese Laid-Open Patent publication No. 2008-243216.
In a single-layer master-worker type parallel program, a single master process monitors all worker processes. This causes a monitoring load on the master process to increase in proportion to the number of worker processes. Thus, execution of a large-scale master-worker type parallel program may involve an enormous monitoring cost of a master process. In contrast, in a multi-layer master-worker type parallel program, the number of processes to be directly monitored by a master process and the number of processes to be directly monitored by each submaster process are smaller. This makes a monitoring load to be borne by one process lower than a monitoring load in the single-layer master-worker type parallel program.
The multi-layer master-worker type parallel program, however, suffers from the problem below. In the single-layer master-worker type parallel program, the master process alone is a process irrelevant to subtask execution. For this reason, one computing node alone which executes the master process is a computing node which does not perform actual computation processing. In contrast, in the multi-layer master-worker parallel type program, neither the master process not a submaster process executes a subtask. Since the multi-layer master-worker type parallel program has a larger number of computing nodes which do not perform actual computation processing than in the single-layer master-worker type parallel program, as described above, the multi-layer master-worker type parallel program suffers from the problem of lower efficiency of computing node utilization (node utilization efficiency).