1. Field of the Invention
This invention relates to a checkpoint restart type fault tolerant computer system of a high reliability and method for detecting a program error without lowering the efficiency of the system.
In particular, this invention relates to a checkpoint restart type fault tolerant computer system having a higher operational speed and including a plurality of error status detecting steps in a program.
More particularly, this invention relates to a check point restart type computer system of a high reliability which can operate with a higher operational speed during a normal operation and which can quickly detect a programming error status when a program error status occurs during a debugging operation.
This invention further relates to a checkpoint based fault tolerant computer system which is suitable for detecting a program error status of a mutual exclusion process between a parallel reprocessing in a system.
2. Discussion of the Background
Recently, a fault tolerant computer system is available for keeping a high reliability of the computer system.
There are two types of fault tolerant computer systems. One is to avoid a fault due to hardware errors in the system. And the other is to detect a fault due to a software error status in a program of the system.
In order to avoid hardware errors, a fault tolerant computer is usually comprised by a dual system. And if a fault due to a hardware error occurs, the system operation is immediately changed to the back up system. This dual construction of the system is useful for avoiding a fault due to a hardware error.
However, even in such a dual system, it is impossible to avoid a fault due to a program error or a programming bug.
Accordingly, it is important to detect an error status in a program as early as possible. In order to detect a programming error or a bug, it is useful to insert a plurality of error status detection (hereinafter referred to as an `ESD`) steps in a program.
In a checkpoint restart type fault tolerant computer system, a programmed process is executed and when an error is detected in an execution between two particular checkpoints, the executed process is rolled back to a prior checkpoint for restarting the execution.
For detecting a programming error or a bug in a program, it is useful to insert a plurality of ESD steps in a program.
However, the execution of many of the ESD steps deteriorates an operational speed of the system during a normal operation.
In order to avoid the deterioration of the operational speed of the system, it is desirable to reduce, as low as possible, the number of executions of the ESD steps in a program during a normal operation.
On the contrary, if a fault due to a programming bug appears, it is desirable to execute as many ESD steps as possible for specifying the location of the bug promptly.