1. Field of the Invention
The present invention relates to a process stop method and apparatus. In particular, the present invention relates to a method and apparatus for performing a process stop in a checkpoint processing executed in a distributed memory system that includes of a plurality of nodes interconnected in a network, each of which has at least one thread for parallel processing.
2. Description of the Related Art
Japanese Unexamined Patent Publication No. 8-263317 shows a checkpoint/restart processing system for controlling the freezing order of plural processes which relate to the synchronous (or exclusive) control in the checkpoint processing.
However, the checkpoint/restart processing system is applied to a shared memory multi-processor system, and not to a distributed memory multi-processor system. In the distributed memory multi-processor system, each of the processors has an own (local) memory which is not accessible to any processes in the other processors. If the checkpoint/restart processing system applies to the distributed memory multi-processor system, there is a possibility that plural processes in different processors cannot perform a synchronization for the checkpoint processing, because a process in a processor, which is frozen and is a counterpart of the synchronization, cannot respond to the request of the synchronization from any processes in other processors. In such a situation, the processes in other processors continue waiting for a response from the frozen process, which of course they will not receive.
Japanese Unexamined Patent Publication No. 2-287858 shows a restart system for a distributed processing system. In this restart system, whenever the communication control part in a processor requests to receive/send data from/to the other processors, a program which causes the communication control part to execute such processing is saved as checkpoint data.
However, the restart system in the latter example cannot save checkpoint data at any given time. Further, the frequent saving of checkpoint data has the adverse effect of lowering the performance of parallel processing.