To tolerate the faults in distributed systems, one of often used art is to move the failed process/job (hereinafter simply say process denoting program, task, job and any running unit, which can perform checkpointing) to another healthy machine/node (hereinafter only node simply). Process migration can move a running process from a node to another node (Non-Patent Literature NPL 1). To move processes simply by migration cannot tolerate failures of hardware because the processes in the failed hardware cannot be retrieved. Hence, process migration takes limited effect on fault tolerance and a similar technique, checkpointing/recovery, emerges (Non-Patent Literature NPL 2).
The existing checkpointing/recovery techniques in distributed systems can be classified roughly into two classes, disk-based and diskless. The mechanism of disk-based checkpointing/recovery has been surveyed in NPL 2. In distributed and parallel computer systems such as clusters, the disk-based checkpointing/recovery is implemented in the distributed way such as that of Non-Patent Literatures NPL 3 and NPL 4. In order to eliminate the delay from disk access, diskless checkpointing was developed and firstly presented in Non-Patent Literature NPL 5 and it became matured in Non-Patent Literature NPL 6. An example of using diskless checkpointing/recovery in cluster systems is described in Non-Patent Literature NPL 7.
Incremental (described in Non-Patent Literature NPL 8) and probabilistic (described in Non-Patent Literature NPL 9) checkpointing can effectively reduce the checkpointing time and the data transmission cost. Finally, multicast is also involved in this invention, but the multicast in this invention is much simpler than that in group communication because the source and the destination of the communication are typical master/slave relation.
No matter what kind of checkpointing/recovery techniques existing in the distributed systems currently, the checkpoint data of a process must be transferred to a newly selected node which is healthy to run the process continually. Two examples are shown here to explain the data transfer. In Non-Patent Literatures NPL 3 and NFL 4, the whole checkpoint data is stored separately in distributed nodes with the parity computed from the data fractions in the neighboring nodes. Thus, in each node are the fractions of the data and the parity of the data fraction stored. When a permanent failure happens in a node, the data fraction will be transferred to its neighbor node recursively and in the neighbor node the data fraction and the parity will be used to calculate the original data fraction in the neighbor node.
Although the nodes operate the checkpointing/recovery in parallel, the parity calculation of each data fraction must be performed sequentially. Therefore, the recovery time is equal to the transmission time of the whole checkpoint data of a process and the calculation time of parities. Here, the time of rebuilding the program running context in memory is ignored since no disk operation is involved and this rebuilding time is usually very short.
In Non-Patent Literature NPL 6, the time delay from the data transmission and the parity computation is the same as that of Non-Patent Literatures NPL 3 and NPL 4, the difference is merely the system of Non-Patent Literature NPL 6 is diskless. In Non-Patent Literature NPL 7, the data of each process is stored in local memory and the parity is calculated from all the data. If a failure happens in a node, the remaining data and the parity will be transferred into a newly selected node.
In Patent Literature PTL 1 (Japanese Laid-open patent publication NO. 2009-129409), it is assumed that all computers are classified into business computers and idle computers. The business computers run the programs and the idle computers only store the data for recovery. It is mentioned in Patent Literature PTL 1 (Japanese Laid-open patent publication NO. 2009-129409) that the checkpoint file of a process can be stored in a computer by unicast or all the computers by multicast. But there is no information for when to use multicast and how many other computers are needed in Patent Literature PTL 1 (Japanese Laid-open patent publication NO. 2009-129409). In Patent Literature PTL 1 (Japanese Laid-open patent publication No. 2009-129409) the checkpoint can only be stored in specific computers and the checkpoint data cannot be cut into pieces and re-integrated in a newly selected computer. The recovery cost is not significantly reduced.
Patent Literature PTL 2 (Japanese Laid-open patent publication No. H10-275133) introduces how the checkpoint and recovery information are collected, stored, handled and reported. There is only one copy of the checkpoint data and the recovery is a deterministic method. The recovery cost is also not significantly reduced.
In Patent Literature PTL 3 (Japanese Laid-open patent publication No. H08-314875), a method for shared memory distributed systems is shown. The method proposed in this patent is different from the above methods because this method only guarantees that the average checkpoint/recovery cost is significantly reduced and the checkpoint data in this invention can be cut into pieces and re-integrated. In other words, this is a randomized method.
The recovery time is at least the transmission time of the data of a process plus the computation time of the parity. Despite that the computation time of parity is argued to be short, the computation time is really too long in wireless and mobile distributed computing, where the data transfer speed is slow and the computing power is weaker than usually desktop.
The recovery time is much longer if the network bandwidth and latency are deteriorated, such as in the large scale distributed systems, Grids, Clouds and P2P (Peer to Peer) systems and wireless and mobile distributed systems. This problem cannot be solved by the existing techniques, since logically the failure process cannot be recovered without transferring all the checkpoint data to the new node. The additional parity computation postpones the recovery time further. Let tD denote the disk access time; let tN denote the data transmission time on network; let tP denote the parity computatfon time; the recovery time tR can be represented as tR=tD+tN+tP. For diskless checkpointing/recovery, tR=tN+tP.