1. Field of the Invention
The present invention relates to a scheme for restarting computers, and more particularly, to a scheme for restarting computers in cases where process states are to be acquired according to checkpoints in a client-server computer system. In the following description, a process state is to be generally construed as an information related to process execution.
2. Description of the Background Art
Conventionally, as a method for improving a reliability of program execution in a computer, the program execution method according to checkpoints has been known. This is a method in which states of processes that are executing entities of a program are acquired either regularly or irregularly according to prescribed checkpoint timings at a time of execution of the program, and the program is re-executed from the process states acquired at the nearest checkpoint when a fault occurs during the program execution. Here, the checkpoint is defined as a time for carrying out the processing to acquire the process states when the program execution is viewed in time sequence, and the checkpoint timing is defined as a time range from one checkpoint to a next checkpoint.
Now, in a system in which one process operates independently, it is sufficient to acquire the process states only at the checkpoints for intermediate states of that process, but in a case where a plurality of processes operate in relation such as that of inter-process communications, it is insufficient to acquire the process states for a single process alone according to the checkpoints. Namely, in order to prevent an occurrence of contradiction at a time of re-execution, there is a need to acquire the process states for a plurality of processes which are mutually related at each checkpoint. In the following, for the sake of convenience, a checkpoint for each process is referred to as a local checkpoint, and a set of local checkpoints for mutually related processes is referred to as a distributed checkpoint.
Also, when a fault occurs in some process or a computer on which that process is operating, it is necessary to carry out re-execute (restart) a plurality of processes by going back to the nearest checkpoint. This is usually referred to as a roll back. A case of applying such a checkpoint/restart mechanism to a distributed system will be referred to as a distributed checkpoint/restart scheme.
Conventionally known methods for acquiring process states according to distributed checkpoint can be largely classified into the following two types.
(1) A process state acquisition based on synchronous distributed checkpoint PA1 (2) A process state acquisition based on asynchronous distributed checkpoint
FIG. 1A shows an exemplary synchronous distributed checkpointing scheme, where a distributed checkpoint CH1 is indicated for a case in which three processes A, B and C execute processing while carrying out message passing.
As a scheme for acquiring process states according to synchronous checkpointing (a synchronous distributed checkpointing scheme), there is a scheme disclosed in K. Mani Chandy and L. Lamport: "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Trans. of Computer Systems, Vol. 3, No. 1, pp. 63-75, February 1985. This scheme deals with the message passing as the inter-process communication, and defines the consistent distributed checkpoint as "a state without a message which is not yet transmitted and already received". More specifically, in this scheme, process states are stored in such a manner that messages that cause contradictions are detected by exchanging messages called markers at a time of storing process states according to distributed checkpoints, and these messages are stored so as to be able to construct consistent states as a whole. Consequently, at the distributed checkpoint CH1 shown in FIG. 1A, each checkpoint is set in a consistent state with respect to each message.
On the other hand, FIG. 1B shows an exemplary asynchronous distributed checkpointing scheme. As indicated in FIG. 1B, in the asynchronous distributed checkpointing scheme, a process state is acquired according to a checkpoint which is located at arbitrary timing in each process. As a scheme for realizing asynchronous checkpointing scheme, there is a scheme disclosed in R. E. Strom and S. Yemini: "Optimistic Recovery in Distributed Systems", ACM Trans. Computer Systems, Vol. 3, No. 3, pp. 204-226, August 1985. In this scheme, when a fault occurs in a process B, the process B is rolled back to a checkpoint CHb, but then this process B requires reproduction of messages m5 and m6 so that the processes A and C are also rolled back to checkpoints CHa and CHc respectively. Then, the process C requires reproduction of a message m4 so that there is a need to further rolled back the process B to a checkpoint earlier than the checkpoint CHb. This state of chained roll back of processes is called a cascade roll back.
In the asynchronous checkpointing scheme, a method called message logging is adopted in order to store received messages at each process so as to prevent the cascade roll back. Namely, in FIG. 1B, those messages for which storing has been completed are indicated by black triangles while those messages for which storing has not been completed are indicated by blank triangles. In FIG. 1B, when a fault occurs in the process B, the process B is restarted from the checkpoint CHb, and a state immediately before receiving the message m6 can be re-executed because the message m5 is stored, but the message m6 is lost so that the process C is also re-executed from the checkpoint CHc so as to re-execute receiving of the stored message m4 and transmitting of the message m6. As for the process A, its execution is continued without any roll back.
Here, each process carries out the receiving processing after the restart according to the stored messages so that an operation of each process must be deterministic (that is, reproductive when the same processing is re-executed again and again). This is because if an operation of the process is indeterministic then there would be a possibility for a transmitting side process to generate a message different from the received message that is already stored.
As described, according to the distributed checkpointing scheme, when a fault occurs in one process or computer, the roll back/restart is caused not just for that one process but also for the other processes which are mutually related with that one process.
FIG. 2 shows a conceptual configuration of a system in the client-server model which is a general model for distributed system. FIG. 2 shows an exemplary case where three processes A, B and C of FIGS. 1A and 1B are operating on a client computer C1, a client computer C2 and a server computer S, respectively. Usually, in the client-server system, client computers C1 and C2 are terminals to be directly used by users, and client processes on a plurality of client computers request processing to a server process on the server computer S. The server process then carries out the request processing and returns a processing result to the client process, and then the client process displays the result received from the server on a screen so as to notify the user.
FIGS. 3A and 3B conceptually show the distributed checkpoint/restart scheme, where FIG. 3A shows a case of the synchronous distributed checkpointing scheme while FIG. 3B shows a case of the asynchronous distributed checkpointing scheme, for an exemplary state in which a fault occurred in the client computer C1 at a timing F1. In either scheme, as the fault occurred in the process A, the processes B and C are also to be restarted from the nearest checkpoints.
In general, the client computer has a lower reliability than the server computer so that the machine malfunction occurs more frequently for the client computer, and as described above, the conventional distributed checkpoint/restart schemes are associated with the problem that, when the fault occurs in one client computer, the entire system including the other client computers and the server computer are to be rolled back. This problem is a very serious one in a case of the client-server system comprising one server computer and hundreds of client computers, because there is a possibility for all the processes on client computers of all users to be rolled back when a client computer used by just one user malfunctions.