1. Field of the Invention
The present invention relates to a fault tolerant computer system of lockstep type and a synchronization method for the same in which a plurality of computer systems simultaneously execute the same instruction string at timing synchronized with a clock signal, and in particular, to a fault tolerant computer system of lockstep type and a synchronization method for the same in which operation activated, when the operational synchronism (a lockstep state) between the computer systems is disturbed in a first computer system, to restore the operational synchronism (to rebuild the lockstep state) can be conducted without stopping the ordinary operation of computer systems other than the first computer system separated from the fault tolerant computer system.
2. Description of the Prior Art
According to the lockstep fault tolerant computer system of the related art, in a plurality of computer systems synchronously executing the same instruction string, when it is detected that one of the computer systems conducts an output operation different from that of the other computer systems due to failure or external or internal factors, the computer system in which the fault takes place is once separated from the other computer systems to be set to a non-operating state. The computer system is replaced with another system if necessary depending on the factor of the fault. If the replacement of the computer system is not required, the system is, for example, re-initialized according to necessity and is set again to an operative state.
In the operation to restore the failed computer system into the operative state in the conventional fault tolerant computer system of lockstep type, the contents of the memory of the failed computer thus replaced or the memory of a computer on the standby side in the initial setup phase is required to be set to match those of an associated computer on the active side. Therefore, it is required to copy the entire data of the main memory of the computer system in an operative state onto the main memory of the computer system to be restored to an operative state.
In the lockstep fault tolerant computer system of the related art, after the failed computer system is replaced and re-initialization is conducted according to a section thereof having caused the failure, the computer systems in the operative state are stopped to restore the failed computer system to an operative state (reference is to be made to, for example, Japanese Patent Application Laid-Open No. 2004-046455).
That is, in the conventional lockstep fault tolerant computer system, it is necessary for the whole computer system to stop operation for a long period of time, i.e., three to five seconds or for a period in the order of minutes to recover the failed computer system to an operative state for the following reason.
To reset the failed computer system to an operative system, if the other computers continue to work when the contents of the main memory of the computer system having been in the operative state are copied onto the main memory of the computer system to be recovered as above, there arises a fear of destruction of the contents of the main memory of the computer system in the operative state. In a case in which the computer system on the copy source side continues operation during the copy operation, if data at an address where the copy has been completed is updated, the main memory contents of the computer system in the operative state cannot be appropriately copied onto the computer system to be restored to an operative state.
To remove the difficulty, there has been proposed a technique, for example, in Japanese Patent Application Laid-Open No. 2001-265750. According to the technique, there are prepared registers to keep addresses of areas onto which the copy has been finished. At occurrence of a write instruction, an address of the instruction is checked to determined whether or not the copy has already been completed for an area designated by the address. If the write instruction is requested with an address of an area onto which the copy operation has been finished, the address is stored in the register. After the memory copy has been completely conducted, the address is obtained from the register to read data from the area. The data is then transferred to the computer of the transfer destination.
In the conventional lockstep fault tolerant computer system, when a computer system separated from the fault tolerant computer system due to a fault is restored to an operative state, the other computer systems in operation are also once stopped to conduct the main memory copy. Since the memory of the computer system at present has a memory capacity of several gigabytes, the entire memory copy takes a long period of time. This leads to a problem that the system operation is stopped for a long period of time. To cope with the problem, there has been proposed a technique in which while limiting the memory access in the computer systems in the active side on which the fault tolerant computer system is normally operating, the contents of the main memory are copied from the active side onto the computer system on the standby side to be restored. However, this arises a problem that special processing is required for software and performance is remarkably lowered in the computer systems continuously executing processing.
According to the method described in Japanese Patent Application Laid-Open No. 2001-265750, in a situation in which a memory read operation is about to be executed for the memory copy, if a memory write instruction is issued for an address of the memory read operation, there is possibility that the memory write operation would go ahead. Therefore, it cannot be guaranteed that the contents of the memory of the transfer source are completely sent to the transfer destination memory. In this method, since data is read from the area of the address at which the memory write has been conducted and then is transferred to the destination, there occurs a problem that the transfer of write data takes a long period of time.