1. Field of the Invention
The present invention relates generally to a multi-node computing system and more specifically to a software-fault tolerant multi-node computing system and a method of enhancing the fault tolerance of a computing system at the level of its software element.
2. Description of the Related Art
Fault tolerant computing systems are known in the computer technology to cope with a system error. Usually, the fault tolerant computing system is one in which a plurality of identical modules are connected to a common bus. All applications are divided into a number of tasks of suitable size and each module is assigned the same task. All the tasks are simultaneously executed by the modules. Through the common bus, each module reads the result of the execution performed by other modules to which the same tasks are assigned and takes a majority decision for masking a system error of any of these modules. However, the presence of a software fault results in a premature ending of the program known as abnormal end.
In a fault tolerant client-server system as described in Japanese Patent Publication 1995-306794, a client process transmits job requests to all server processes and test execution commands to all error-detection processes. Each server processes the job request and runs a test on the process and returns a response to the client process. On receiving all responses from the servers, the client process produces an output based on a majority decision on the process results as well as on the test results.
A technical paper “Fault Tolerance by Design Diversity: Concepts and Experiments” (A. Avizienis and John P. J. Kelly, IEEE Computer Society. Vol. 17, No. 8, August 1984, pages 67-80) describes a fault tolerance scheme in which a plurality of different versions of an identical application program are independently designed by different developers. These independently designed different software versions are simultaneously executed in parallel. By taking a majority decision from the results of the parallel computations, a software error in one of these programs is masked. However, the development cost of this version diversity technique is prohibitively high.
Japanese Patent Publication 1994-342369 discloses a version diversity parallel-running computing system. To develop many different versions of a program at a low cost as possible, an application program is divided into a number of software modules according to different functions. For each software module, a number of different versions are developed and different versions of different functional modules are combined. For example, if a program is divided into modules A, B and C and for each module two versions are developed, yielding modules A1, A2, B1, B2, C1 and C2. These modules are combined to develop a total of eight sets of modules {A1, B1, C1}, {A1, B1, C2}, {A1, B2, C1}, {A1, B2, C2}, {A2, B1, C1}, {A2, B1, C2}, {A2, B2, C1} and {A2, B2, C2} at a cost equivalent to the cost of developing two versions of a program.
A technical article “Distributed Execution of Recovery Blocks An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications” (K. H. Kim, Howard O. Welch, IEEE Transactions on Computers, Vol. 38, No. 5, pp: 626-636, 1989) describes a recovery block system to verify the processing result of a program. Similar to version diversity systems, a number of different versions of a program are developed and different ranks are accorded to the developed versions respectively. The program of the highest rank is first executed and the result of the execution is checked with an algorithm called “acceptance test”. If verified, the result is selected. If not, the program of the second highest rank is then executed and its result is selected.
Also known is the ping-pong system of conventional computers which uses the heart-beat signal for regularly checking their operation.
However, one shortcoming of the prior art fault tolerance techniques is that, since the failure detection is performed only after results are obtained from all processing nodes, an abnormal node cannot quickly be detected before all processing results are obtained. If the ping-pong system is used, a node failure may be detected instantly when there is no “pong” response at all. However, it is impossible to provide an early warning signal when an abnormal node is present in a multi-node computing system.
Therefore, there exists a need for a fault tolerant parallel-running computing system capable of detecting an abnormal node well prior to the time all processing nodes produce their results. Further, a failure of discovering early symptom of a node failure would result in a useless consumption of system resource as well as an increase in the execution cost. If the potential cause of a trouble goes unnoticed for an extended period of time, the trouble would grow and propagate to properly operating nodes of the system and adversely affect its fault tolerance.