A fault tolerant system is known as a computer system with high availability. In the fault tolerant system, an application or OS (Operation System) can transparently continue without special processing. As a system forming the fault tolerant system, a hardware system and a software system are known. The present invention relates to an innovation of the fault tolerant system employing the hardware system.
In the fault tolerant system of the hardware system, main hardware components, like a CPU (Central Processing Unit), memory, storage, etc. include redundant configurations. If a failure occurs in a component, the fault tolerant system using the hardware system separates the component therefrom and continues operations. A module including the CPU and memory is called a CPU sub-system, and a module including various types of IO (Input Output) devices is called an IO sub-system. In a common fault tolerant system with duplicate component, a method of system duplexing for the CPU sub-system differs from a method of system duplexing for the IO sub-system. The duplicated CPU sub-systems perfectly match hardware operations by the clock. This is called lockstep synchronization. Since the CPU sub-systems which are duplicated perform perfectly identical operations, the CPU sub-system in which a failure occurs is logically separated when the failure occurs, and the CPU sub-system which is normal instantly takes over the operations to continue the operations. Though the IO sub-system is not in the lockstep synchronization, another IO sub-system takes over operations when a failure occurs.
A method in which comparing data of accessing from each of the CPU sub-systems to the IO sub-system is compared is known as a method of detecting an abnormal operation of the CPU sub-system in the lockstep synchronous state. In the method, each of the sub-systems generates checksum from access data generated in a CPU of the sub-system. Each of the sub-systems transmits the generated checksum to another sub-system through a crosslink. Each of the sub-systems detects mismatch of operations in the sub-systems by comparing the generated checksum with the checksum received through the crosslink. The method described above is disclosed as a first related technology related to the invention (e.g. refer to Patent Literature 1 (Japanese Patent Application Laid-open No. 2010-218370)).
Related technologies of the fault tolerant system are described below.
A fault tolerant system in which two systems are connected to each other through the crosslink is known. Each of the two systems includes a CPU sub-system, an IO system connected thereto, and a ft (fault tolerant) controller. The CPU sub-system works at the same timing between the systems based on clock step synchronization. The ft controller is connected between the two systems. The ft controller associates a plurality of system operations performing error processing, duplexing processing, and resynchronization for fault tolerant in the systems with preset event signals each of which represents one of a plurality of relating states, respectively. Thereby the ft controller manages the plurality of system operations. The ft controller selects a system operation in the plurality of system operations depending on the event signals and causes the CPU sub-system to operate the selected system operation, while transferring the state of each system. The fault tolerant system described above is disclosed as a second related technology related to the invention (e.g. refer to Patent Literature 2 (Japanese Patent Application Laid-open No. 2006-178616)).
A fault tolerant system described below is known as the above-described fault tolerant system. In the fault tolerant system, tag information including ID (Identifier) codes of an access source and an access destination and synchronization information on whether access is synchronous or not are given to an access packet from a CPU sub-system to an IO sub-system. An access comparing unit of each system determines whether to perform a first access operation or a second access operation on the basis of the tag information given to the access packet. The first access operation is an access operation which is performed when a plurality of CPU sub-systems are in a lockstep synchronous state. The second access operation is an access operation depending on an asynchronous state. The fault tolerant system described above is disclosed as a third related technology related to the invention (e.g. refer to Patent Literature 3 (Japanese Patent Application Laid-open No. 2006-178615)).