A fault-tolerant system is known as a technique for enabling continuation of service processed by a computer in operation by masking a hardware fault even when the fault occurs in the computer. A fault-tolerant system which uses the lockstep scheme is available as an exemplary fault-tolerant system. In the lockstep scheme, hardware components of the computer serve as multiple-system components. The respective systems including identical hardware components perform the same operation in synchronism at the same clock frequency. Performing the same operation in synchronism at the same clock frequency will also be referred to as a lockstep operation hereinafter. The status in which the same operation is performed in synchronism at the same clock frequency will also be referred to as a lockstep status hereinafter. The status in which the lockstep status fails to be maintained due, for example, to a fault will also be referred to as loss of lockstep hereinafter. In the lockstep scheme, even when one of a plurality of systems suffers a fault and causes loss of lockstep, the processing can be continued by the operations of the remaining normal systems.
An exemplary fault-tolerant system which uses such a lockstep scheme is disclosed in reference 1 (Japanese Unexamined Patent Application Publication No. 2009-205630).
The fault-tolerant system disclosed in the reference 1 includes a plurality of systems including identical hardware components. Each system includes a processor system including a CPU (Central Processing Unit), an I/O system including I/O (input/output) devices such as a storage device and a network device, and a controller. The processor system of each system performs a lockstep operation. The I/O system of each system is configured to maintain sufficient redundancy between the individual I/O systems by mirroring processing which uses the CPU of the processor system.
The controller determines whether an inconsistency has occurred in operation between the processor systems. The controller, for example, compares data to be transferred from the self-system processor system to the self-system I/O system with data to be transferred from the different-system processor system to the self-system I/O system. When an inconsistency occurs in these data, the controller separates a processor system determined in accordance with a predefined method from the fault-tolerant system.
An inconsistency may occur in the data when, for example, data flowing from the CPU is partially garbled, or the data timing becomes off. Further, the inconsistency may occur in the data when an abnormality occurs within the processor system performing the lockstep operation. It may be temporarily determined that a fault has occurred upon, for example, memory garbling due to the presence of external electrical noise, cosmic rays, or other types of radiation. In this case, the processor system detected to have the fault is separated from the fault-tolerant system. Various methods have been proposed to separate such a processor system. For example, a method is available for calculating levels of priority based on MTBF (Mean Time Between Failure) or a frequency of occurrence of faults of each processor system and determining the processor system to be separated based on the calculated levels of priority.
In this manner, with the lockstep fault-tolerant system, even when a processor system which may suffer the fault is separated, the processor systems of the remaining systems continue the processing. Then, when the separated processor system is determined to be normal or the like and is therefore mounted in the fault-tolerant system again, the processor system performs the lockstep operation again.