The present invention is related to general purpose digital computer systems, and is more particularly related to fault tolerant computer systems.
A number of fault tolerant systems have recently been developed. Some such systems offer pure software solutions for non-stop operation by requiring the user to program checkpoints into the data processing routines wherein results from a processor of the system can be compared by software to determine if the system is continuing to operate correctly and without error.
Other systems offer complete hardware solutions, including redundant logic with total transparency to software on all solid failures. However, processing in such systems cannot continue on a unit when a transient error occurs because special diagnostics must be invoked to determine if, in fact, the error is a transient error rather than a solid failure. Many times, a second processor is required to ensure non-stop operation on both transient errors and solid failures. With two processors in the system, only 50% of the potential computational power of each processor is utilized, because both processors must be executing identical tasks in parallel to provide continued operation in the event of a failure. When a detected failure is corrected in the faulty unit of such a system, the two processors typically must be resynchronized to continue parallel operations.
Such systems generally require significant overhead on transient errors (which statistically occur from 10-100 times more frequently than hard errors) and have a period of vulnerability on the order of one million machine cycles (the time required to bring the first processor back on-line). A transient error occurring in the second processor during this period of vulnerability will bring the system down.
U.S. Pat. No. 4,453,215 issued June 5, 1984 to Reid for "Central Processing Apparatus for Fault-Tolerant Computing" discloses a fault tolerant computer system in which the information-handling parts of the system have a duplicate partner. Error detectors check the operation of the system to provide information transfers only on fault-free bus conductors and between fault-free units.
Other patents which show the state of the art include U.S. Pat. No. 4,165,533 issued Aug. 21, 1979 to Jonsson for "Identification of a Faulty Address Decoder in a Function Unit of a Computer Having a Plurality of Function Units With Redundant Address Decoders"; U.S. Pat. No. 4,453,210 issued June 5, 1984 to Suzuki et al. for "Multiprocessor Information Processing System Having Fault Detection Function Based on Periodic Supervision Of Updated Fault Supervising Codes"; U.S. Pat. No. 4,453,213 issued June 5, 1984 to Romagosa for "Error Reporting Scheme"; and U.S. Pat. No. 4,456,993 issued June 26, 1984 to Taniguchi et al. for "Data Processing System With Error Processing Apparatus and Error Processing Method."