1. Field of the Invention
The present invention relates to a process state management method and a process state management system, and more particularly to a method and a system for managing process states using checkpoints in cases where one process is generated from another process.
2. Description of the Background Art
Conventionally, as a method for improving a reliability of program execution in a computer, the program execution method according to checkpoints has been known. This is a method in which states of processes that are executing entities of a program are acquired either regularly or irregularly according to prescribed checkpoint timings at a time of execution of the program, and the program is re-executed from the process states acquired at the nearest checkpoint when a trouble occurs during the program execution. Here, the checkpoint is defined as a time for carrying out the processing to acquire the process states when the program execution is viewed in a time sequence, and the checkpoint timing is defined as a time range from one checkpoint to a next checkpoint.
Now, in a system in which one process operates independently, it is sufficient to acquire the process states only at the checkpoints for intermediate states of that process, but in a case where a plurality of processes operate in relation such as that of inter-process communications, it is insufficient to acquire the process states for a single process alone according to the checkpoints. Namely, in order to prevent an occurrence of contradiction at a time of re-execution, there is a need to acquire process states for a plurality of processes which are mutually related at each checkpoint. In the following, for the sake of convenience, a checkpoint for each process is referred to as a local checkpoint, and a set of local checkpoints for mutually related processes is referred to as a distributed checkpoint.
As described, In a case where a plurality of processes operate in relation such as that of inter-process communications, it is necessary to acquire the process states of these plurality of mutually related processes consistently (without contradiction). This point will now be illustrated in further detail by referring to FIGS. 1A, 1B and 1C.
Namely, FIGS. 1A, 1B and 1C show examples of a distributed checkpoint. More specifically, FIGS. 1A, 1B and 1C show three types of distributed checkpoints CH1, CH2, and CH3 in a case where a processing is carried out while each one of three processes p1, p2 and p3 carries out the message passing. In FIGS. 1A, 1B, and 1C, a symbol m indicates a message, and two numerals suffixed to this symbol m indicate a message transmission side process number and a message reception side process number respectively.
In FIG. 1A, at the distributed checkpoint CH1, there is no contradicting states for each message when the process states are acquired according to local checkpoints ch11, ch12 and ch13, so that the message passing can be carried out correctly even when the processing is restarted by rolling back to the nearest checkpoint. However, in FIG. 1B, for a message m32 at the distributed checkpoint CH2, despite of the fact that the process p3 is still in a state of not transmitting this message at the local checkpoint ch13, the process p2 is in a state of already receiving this message at the local checkpoint ch12. For this reason, when a trouble occurs in any one process and the processing is to be restarted by rolling back to the distributed checkpoint CH1, contradicting states regarding a message m32 arise. Similarly, for the distributed checkpoint CH3 of FIG. 1C, contradicting states regarding a message m23 arise.
The conventionally proposed methods for guaranteeing the consistency of distributed checkpoints deal with the message passing, and include a synchronous checkpointing method and an asynchronous checkpointing method.
As a scheme for acquiring process states according to synchronous checkpointing, there is a scheme disclosed in K. Mani Chandy and L. Lamport: "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Trans. of Computer Systems, Vol. 3, No. 1, pp. 63-75 (February 1985). This scheme deals with the message passing as the inter-process communication, similarly as the examples described above, and defines the consistent distributed checkpoint as "a state without a message which is not yet transmitted and already received". Here, a state without a message which is not yet transmitted and already received is a state where a message m23 exists in a case of FIG. 1B described above.
Also, at CH3, m23 will be lost so that such a message which is already transmitted and not yet received will be stored as acquired information. As a specific algorithm for this, process states are stored in such a manner that messages that cause contradictions are detected by exchanging messages called markers at a time of storing process states according to distributed checkpoints, and these messages are stored so as to be able to construct consistent states as a whole.
Also, in the general operating system, at a time of generating a new process, there are cases where a currently operating process newly generates its own copy. For example, in UNIX, the fork system call corresponds to this function by which a process with the same content as a process that called up this fork system call is generated. Here, a process that called up this fork system call is called a parent process, and a process newly generated from the parent process is called a child process.
FIG. 2 shows an exemplary checkpoint in a case of generating a new process in the synchronous checkpointing. In FIG. 2, a process A generates distributed checkpoints CP(n) and CP(n+1), and between these, the process A also generates a process B by using the fork system call. At this point, at CP(n+1), the process A is unrelated to the process B so that no checkpoint is generated for the process B. However, afterwards, the processes A and B come to have a relationship through messages m1 and m2. Then, when a trouble (fault) F1 occurs later on, the process A is going to be rolled back to CP(n+1) and restarted from there on, but the process B has no corresponding check point so that the process state has not been acquired for the process B and therefore it is impossible to restart the process B correctly.
Thus, in the synchronous checkpointing method for distributed checkpoints that deal with a plurality of processes, it is impossible to acquire process states consistently in a case where a new process is generated from some process, and for this reason, it is impossible to restart a newly generated process correctly in a case where a trouble occurs during the program execution and the restart is required.