1. Field of the Invention
The present invention relates to method and apparatus for recovering from faults in a checkponting and roll back type fault tolerant computing system which can dynamically avoid faults by rolling back the processing to a previous checkpoint which was acquired just before the software fault occurs.
More particualarly, the present invention relates to method and apparatus for recovering from software faults in a checkpointing and roll back recovery type fault tolerant computing system.
In particular, the present invention relates to method and apparatus for dynamically avoiding the system being down due to software faults which occur in a kernel portion of the operating system for the checkpointing and roll back recovery type fault tolerant type computing system.
2. Discussion of the Background
Various kinds of fault recovery techniques have been developed in computing systems for achieving higher reliability of the computing systems. It is now required to achieve much higher reliabilty by quick recovery from faults with a minimum amount of disruption for the recovery process.
A checkpointiong and roll back recovery from faults is one technique for achieving such a high reliability by achieving quick recovery from faults in fault tolerant type computing systems.
In the checkpointing and roll back recovery type fault tolerant computing system, during normal data processing, execution information of the data processing has been stored at a certain time interval for restarting the data processing from a particular point when a fault occurs in the processing. This operation for storing the execution information at a particular point is referred to as a checkpoint acquisition.
FIG. 8 shows the principle of the checkpointing. Usually, checkpoints(1) and (2) are periodically acquired at a certain time interval. If a transient hardware fault(3) occurs and interrupts the executed processing, the state of the interrupted process or the thread is rolled back (4) to a previous checkpoint(2) which was acquired just before the occurrence of the fault(3) and the processing is restarted from the checkpoint(5).
The stored execution information by the checkpointing includes an internal content in general purpose registers and data in a cache memory or a main memory.
Generally, causes of a computer system down condition classified into hardware faults and software faults. The checkpointing and restarting type fault tolerant computing system is extremely useful for avoiding the hardware fault. For example it is useful for avoiding a main memory ECC error.
On the contrary, it is difficult to recover from a system down condition caused by software faults. In particular, it is the most difficult to avoid a system down condition which is caused by bugs which appear within a kernel portion of the operating system for the computer or an application process.
Usually, such software faults bring the computer system down or bring all of the application process stack. Also software faults caused by bugs in an application process usually bring the process to an abnormal end.
For avoiding these type of system down conditions caused by software faults, a progressive retrial method of a software module has been proposed. The method is to improve the fault tolerance of the system which includes a plurality of application processes by providing a special software library. The software library stores a state of the application process periodically in a nonvolatile memory (i.e., checkpointing).
By registering a transmitting message into a sender log file and also by registering a received message into a receiver log file, it becomes possible to restart the process execution from a checkpoint when a fault occurs.
If the reexecution process does not start, the rolling back area is gradually expanded with reference to the registered message in the sender log file.
Further, it is possible to avoid timing dependent software faults by changing the registering order of the received message in the receiver file within the scope that the total coordination of the received message can be maintained.
The proposed method achieves the fault tolerance of the computer system by using the software library in the application process and the main method for avoiding timing dependent software faults is to change the order of the communication message for transmitting or receiving between nodes within the scope that the total coordination of the received message can be maintained. This method is useful for avoiding the timing dependent software faults.
However, the proposed method cannot recover software faults thats occur in an application process level because the system itself is brought down when the software fault occurs in the operating system kernel.