The present invention relates to an independent checkpointing method, which recovers problems in running a distributed system and reduces work time. In the method in accordance with an embodiment of the present invention, transmitting processes send messages with a self checkpoint number, and receiving processes determine whether memory checkpoint is to be performed in reference to the transmitting processes"" checkpoint number, current process checkpoint number, memory checkpoint flag, and message transmission flag before processing the received message. In performing periodical checkpointings, the method makes reference to the result of the memory checkpoint for checkpointing.
Checkpointing technology stores status information of each process and recovers errors by using the stored information when errors occur in running distributed system environments.
Several studies have been done in the area of checkpointing technology. An adaptive checkpointing algorithm was proposed by Jian Xu et al. at Proceedings of Fifth IEEE Symposium on Parallel and Distributed Processing in 1993. The adaptive checkpointing algorithm performs checkpointing by checking whether zigzag cycles made by an input message exist. If input messages make zigzag cycles, the adaptive checkpointing algorithm is based upon the fact that a domino effect may be caused. Therefore, if input messages make zigzag cycles, the adaptive checkpointing algorithm performs checkpointing to remove zigzag cycles before the processing of messages.
A lazy checkpointing algorithm was proposed by Wang et al. at Technical report CRHC-92-27. In the lazy checkpointing algorithm, message transmission processes attach a self checkpointing number to messages to be transmitted, and the message receiving processes compare the checkpointing number of the transmitted messages with their checkpoint number before processing the transmitted messages. If the checkpoint number of the transmission process is larger, checkpointing is performed before the message is processed. At the moment, the checkpoint numbers of the two processes become identical.
However, when the amount of message transmission and execution speed difference increases, these checkpointing methods may cause a large number of checkpoints and increase job completion time.
The number of checkpoints is directly related with job completion time in the error-free environment and the roll back distance is directly related with job completion time in the environment with errors. Therefore, a large number of checkpoints and an increase in roll back distance causes delayed job completion time.
An independent checkpointing method using memory checkpoint on a distributed system is provided.
The independent checkpointing method in accordance with an embodiment of the present invention includes a message transmission routine, a message processing routine, and a periodical checkpoint routine. The message transmission routine adds a self checkpoint number to a message to be transmitted when a current process tries to send a message to another process. The message processing routine performs a memory checkpoint and processes a message in reference to a checkpoint number of a transmission process, a checkpoint number of the current process, a memory checkpoint flag, and a message transmission flag when a message is received from a process. The periodical checkpoint routine performs a checkpoint that records a necessary state information for recovery against faults periodically in reference to the memory checkpoint flag.
Preferably, the message transmission routine includes the following steps. A step is to generate a message to be transmitted. Another step is to add the checkpoint number of the current process to the message to be transmitted. A further step is to set the message transmission flag true for preparing cases in which an orphan message occurs. An additional step is to transmit the message.
Preferably, the message processing routine includes the following steps. A first step is to receive the message from the process and compare the checkpoint number of the transmission process with the checkpoint number of the current process. Another step is to process the received message if the checkpoint number of the transmission process is smaller than or equal to the checkpoint number of the current process. A further step is to check the memory checkpoint flag if the checkpoint number of the transmission process is larger than the checkpoint number of the current process. Another step is to replace the checkpoint number of the current process with the checkpoint number of the message transmission process and process the received message if the checked memory checkpoint flag is true. A further step is to check the message transmission flag if the checked memory checkpoint flag is false. An additional step is to record the state information of the current process into a memory, set the checkpoint number of the current process as the checkpoint number of message transmission process, set the memory checkpoint flag true, set the message transmission flag false, and process the received message if the checked message transmission flag is true. Another additional step is to replace the checkpoint number of the current process with the checkpoint number of the transmission process and process the received if the checked message transmission flag is false.
Preferably, the periodical checkpoint routine includes the following steps. An initial step is to check the memory checkpoint flag on a periodical checkpoint time. Another step is to record the state information stored at the memory to a disk if the checked memory checkpoint flag is true or recording the state information to a disk and increasing the checkpoint number by one if the checked memory checkpoint flag is false. A further step is to calculate a next periodical checkpoint time. An additional step is to set the memory checkpoint flag and the message transmission flag false.