1. Field of the Invention
The present invention relates generally to a multi-node computing system and more specifically to a software-fault tolerant multi-node computing system and a method of enhancing the fault tolerance of a computing system at the level of its software element.
2. Description of the Related Art
Fault tolerant computing systems are known in the computer technology to cope with a system error. Usually, the fault tolerant computing system is one in which a plurality of identical modules are connected to a common bus. All applications are divided into a number of tasks of suitable size and each module is assigned the same tasks. All the tasks are simultaneously executed by the modules. Through the common bus, each module reads the result of the execution performed by other modules to which the same tasks are assigned and takes a majority decision for masking a system error of any of these modules. However, the presence of a software error results in a premature ending of the program known as abnormal end.
A technical paper “Fault Tolerance by Design Diversity: Concepts and Experiments” (A. Avizienis and John P. J. Kelly, IEEE Computer Society. Vol. 17, No. 8, August 1984, pages 67–80) describes a fault tolerance scheme in which a plurality of programs of identical function for achieving the same purpose are independently designed by different developers. These independently designed programs are simultaneously executed in parallel. By taking a majority decision from the results of the parallel computations, a software error in one of these programs is masked. However, the development cost of this technique is prohibitively high.
U.S. Pat. No. 5,712,971 discloses a method known as checkpointing scheme for reconstructing the state of an interrupted computation in a parallel processing system. According to this scheme, an application program is executed in distinct execution phases by issuing commands and replies. All such commands and replies are recorded and the end state of each successfully completed execution phase is saved in memory. If a failure is detected in any of such execution phases, the last-saved end state of the execution phase prior to the detection of the failure is restored and all recorded commands and replies from the beginning of execution of the application up through the restored last-saved end state are recapitulated.
U.S. Pat. No. 5,440,726 discloses a fault recovery technique which performs management of checkpointing data. When a failure is detected, the checkpointing data is retried item by item in search of the best checkpointing data.
Also known in the computing art is the method of making regular backup copies of data of the hard-disk onto a separate storage medium. The checkpointing scheme differs from the backup method in that it additionally saves dynamic data such as the internal variables and the running point of execution. To ensure precision checkpointing, it is necessary to prevent the system from saving an internal variable which is currently changing and to cause the system to save the running point of execution.
Another shortcoming of the prior art checkpointing scheme is that, since a significant amount of time lapses from the instant the cause of a failure occurs to the instant the failure is detected, difficulty arises to determine which part of the saved checkpointing data is to be restored. For identifying the potential cause of a trouble, the usual practice involves system engineers making a search through the logs delivered from the computer and through the data indicating the running state of a program dumped at the instant the failure is detected. Appropriate checkpointing data is then selected based on the determined cause to recover the system from the failure. Since this approach is a time consuming job, a substantial amount of man-hours would be required.
Therefore, there exists a need for a fault tolerant computing system of enhanced tolerance to software errors.