1. Field of the Invention
The present invention relates to a fault-tolerant computer with DMR (Double Modular Redundancy) systems each including a CPU subsystem having a CPU and a main storage unit and an IO subsystem, and a method of controlling such a fault-tolerant computer.
2. Description of the Related Art
In recent years, growing computer functionality has enabled computers to be used in a wider variety of fields. Such computers are required to operate continuously even in the event of faults. One solution to meet the requirement is fault-tolerant technology with duplex systems.
A fault-tolerant computer built on the fault-tolerant technology has duplex systems each including a CPU subsystem having a CPU and a main storage unit and an IO subsystem. Some fault-tolerant computers employ a lock-step system for operating the two CPU subsystems synchronously with each other based on synchronous clocks.
In fault-tolerant computers with a lock-step system, the two CPU subsystems may operate out of synchronism, tending to bring the data stored in the main storage units thereof out of conformity with each other. Causes of the asynchronism are of relatively minor nature, including a CPU failure, a software error caused by a radiation to change the data stored in the main storage units, and an operational event that is not guaranteed by the lock-step system.
When a fault-tolerant computer with a lock-step system suffers an asynchronous operation, a resynchronizing process is performed to synchronize the two CPU subsystems with each other. The resynchronizing process is divided into several phases. One of the phases copies the data stored in the main storage unit of one of the CPU subsystems to the main storage unit of the other CPU subsystem, thereby bringing the data in the main storage units into conformity with each other.
A conventional fault-tolerant computer that is designed to speed up the above resynchronizing process is disclosed in JP-A-1998-177498 (hereinafter referred to as “Document 1”), for example.
In the fault-tolerant computer disclosed in Document 1, when a write event occurs to write data to one of the main storage units while the CPU subsystems are operating in synchronism with each other, each of the CPU subsystems records information representing the position (page) of the main storage unit where the recorded data is updated by the write event and the time when the recorded data is updated by the write event. In the event of an asynchronous operation, based on the recorded information, only the updated data of the data stored in the main storage unit of one of the CPU subsystems is copied to the main storage unit of the other CPU subsystem.
In the fault-tolerant computer disclosed in Document 1, since only the updated data, but not all the data stored in the main storage unit, is copied, the time required to copy the data is shortened, and hence the resynchronizing process is shortened.
However, when an asynchronous operation of the fault-tolerant computer disclosed in Document 1 is detected, the cause of the asynchronous operation is not detected, but the resynchronizing process is immediately carried out to copy the updated data of the data stored in the main storage unit of one of the CPU subsystems to the main storage unit of the other CPU subsystem.
Consequently, if the asynchronous operation is caused by a failure of the CPU of the CPU subsystem from which the updated data is to be copied, then the fault-tolerant computer will not be guaranteed for its operation even after the resynchronizing process is fully completed.
Accordingly, when an asynchronous operation of a fault-tolerant computer, is detected, it is necessary to identify the cause of the asynchronous operation and then to start the resynchronizing process.
If a long period of time is consumed to identify the cause of an asynchronous operation, however, because the CPU subsystems keep operating out of synchronism during that period of time, the nonconformity of the data stored in the main storage units of the CPU subsystems grows, making it difficult to shorten the resynchronizing process that is to be performed subsequently.
Therefore, if the cause of an asynchronous operation is to be identified prior to the start of the resynchronizing process, then it is necessary to minimize the nonconformity of the data stored in the main storage units of the CPU subsystems.