1. Field of the Invention
The present invention relates to method and apparatus for recovering from faults in a distributed memory multiprocessor computing system, and more particularly to fault recovery in a checkpointing and rollback type fault tolerant computing system of a distributed memory type multinode system.
In particular, this invention relates to method and apparatus for communicating message to another node with avoiding delay among the distributed nodes in a large scale multinode computing system for achieving fault tolerance by checkpointing and rollback recovery
2. Discussion of the Background
Distributed multinode computing systems are used in large scale computing fields, such as large scale science technical computing or data processing. The distributed multinode computer is required a higher reliability of the total system.
In case of that each of computing nodes is used as a server computer in a large scale distributed computing network, it is extremely important to maintain the higher reliability of the total system.
Checkpointing and rollback recovery is a technique for achieving the higher reliability in computing systems. The basic function of the checkpointing and rollback recovery is shown in FIG. 26. A processor in the system executes normal data processing with periodically acquiring checkpoints CKP0, CKP1, . . . and when a fault is detected during the data processing DTP1, the processor rolls back the data processing DTP1 to a previous checkpoint CKP1 which has acquired just before the occurrence of the fault. After causality of the fault is eliminated, the processor restarts the data processing from that checkpoint CKP1.
When the checkpointing and rollback recovery computers are used in a large scale data processing system, a higher reliability of the total system can be basically achieved by the checkpointing and rollback recovery function in each of the distributed nodes.
Usually, such a large scale distributed multinode computing system includes a multiplicity of nodes of several hundreds to several thousands number. The total reliability of the multinode system can be obtained by multiplication of the respective reliability of each of nodes. When the system includes 1024 nodes and each of nodes performs about 99.99% reliability, the total reliability of the system is 90.27%. As apparently, as the more increased number of element nodes are included in the system, the lower reliability of the total system is obtained. The increasing of the node number deteriorates the total system.
For improving this defect, it has been considered to increase the reliability/availability factor of each node in the system. For example, it the availability factor of each node becomes 99.999%, the total availability factor of the system can be improved up to 98.98%.
However, when checkpointing and rollback recovery type computers are used in a distributed multinode computing system, there is another serious problem to be solved. That is latency of message communication among a plurality of nodes which are commonly coupled through a communication path. The causality of the latency of message communication among a plurality of nodes will be explained with reference to FIGS. 27-29.
FIG. 27 shows a message communication between two nodes. During the checkpointing and rollback recovery computer A executing a normal processing after acquiring a checkpoint CKP1, another computer B sends a request message (a) to the computer A through a communication line. The computer A immediately executes the requested process and sends back a reply(b) to the computer B immediately. After that, at the time T2, a fault FLT1 is detected in the computer A. The computer A rolls back its processing to the pervious checkpoint CKP1 by cancelling all of the data processing which has been previously executed from the checkpoint. In this case, the computer B must resend the request(a) to the computer A during a restarted execution for maintaining consistency of the state. However, since the computer B has already received the reply from the computer A, it can't recognize the roll back operation by the computer A and does not send the request message during the restarted execution. Consequently, inconsistent state occurs between the computers A and B.
For avoiding the inconsistency between the computers, delay sending of the reply message is inevitable for the processing in the computer A. For doing so, the computer A holds the executed result of the request into a holding block (c) as shown in FIG. 28. When the computer A acquires a next checkpoint CKP2 at a time T3 in FIG. 29, the message in the block (c) is communicated to the computer B as a reply (b) in response to the request (a). Even if a fault occurs after the time T3, since the computer A rolls back and restarts from the checkpoint CKP2, the reply (b) can be communicated again during the restarted processing. Accordingly, the consistency between the computers A and B can be maintained.
If a fault FLT1 occurs before acquiring of the next checkpoint CKP2 as shown in FIG. 30, the computer B can recognize the abnormal state of the system by detecting a time interval for no reply from the computer A. Since the computer A cancels all of the data processing related to the request(a) when it rolls back to the checkpoint CKP 1, the computer B can send the request (a) again to the computer A during a recovering processing.
It usually takes a half of time delay during one checkpointing interval for message communication between nodes in a distributed multinode computing system. Practically, since it takes several milliseconds for one checkpointing interval, it needs for each of message communications between two computers to have a delay of at least more than one millisecond. This delay of message communication between nodes deteriorates the total performance of the multiprocessor system. In particular, when the message communication among the nodes are frequently occurred in the system, the total performance of the multiprocessor system is extremely deteriorated because of its overhead for a checkpointing.