This Application is related to the following commonly-assigned, co-pending application entitled METHOD AND SYSTEM FOR UPGRADING FAULT-TOLERANT SYSTEMS, identified by Cesari and McKenna File No. 104160-0010.
1. Field of the Invention
The present invention is related to fault-tolerant computer systems and, in particular, to a method for efficiently providing reliable operation in a computer system.
2. Background Information
In most data processing applications, reliable performance of a computer system is critical. To provide for a specified level of reliability, the computer system may include at least one redundant, or backup, central processing unit (CPU), where the CPUs perform the same operations and provide the same data output stream. The input/output (I/O) buses of the CPUs are continually monitored and compared to identify any differences in their respective data streams. If signal differences are detected, a voting device applies predetermined criteria to identify one of the CPUs as malfunctioning. In a redundant computer system having two CPUs, for example, the voting device may identify the CPU having a history of greater cumulative error correction as the malfunctioning CPU. However, experience has shown that this method has an unacceptably low accuracy rate.
The accuracy rate improves with the addition of a second redundant CPU to the computer system. All three CPU outputs are monitored and, when differences are detected, the CPU determined to be malfunctioning is the CPU producing an output not in agreement with the other two CPUs. This approach, however, incurs the additional expense and complexity of integrating the third CPU into the computer system.
Another method used in computer systems having only two redundant CPUs is to have each CPU revert to an idle state and/or lose output data while diagnostic procedures are initiated to determine which CPU is malfunctioning. Based on the results of the diagnostic procedure, one CPU may be identified as malfunctioning. One undesirable side effect of this approach is that the operation of the computer system is impacted and may be severely disrupted while the CPUs are in the idle state.
It is therefore an object of the present invention to provide a computer system achieving a high degree of reliability with a redundant CPU.
It is a further object of the present invention to provide such a computer system in which a malfunctioning CPU can be identified without first placing the CPU into an idle state.
It is a still further object of the present invention to provide such a computer system in which computational data is not lost while the malfunctioning CPU is identified.
It is yet another object of the present invention to provide such a system in which a malfunctioning CPU can be identified with a high degree of reliability. Other objects of the invention will be obvious, in part, and, in part, will become apparent when reading the detailed description to follow.
The present invention comprises a fault-tolerant computer system which includes a pair of CPUs that produce essentially identical data output streams, a voter delay buffer having first and second FIFO buffers, and an I/O module interconnecting the CPUs and the FIFO buffers. The I/O module compares the data output streams from the two CPUs for differences. If both CPU output streams remain identical, the data output of a selected CPU is transmitted to one or more peripheral devices. Otherwise, if the comparator indicates differences, the data output stream from one CPU is rerouted to the first FIFO, and the data output streams from the other CPU is rerouted to the second FIFO. Meanwhile, the CPUs continue processing operations and ongoing diagnostic procedures to identify one of the CPUs as malfunctioning. The FIFOs provide buffering for the data output streams which would otherwise be discarded. Additionally, use of the FIFOs allows the CPUs to continue operation and avoid a disruption to the computer system. If neither CPU is diagnosed as malfunctioning, the I/O module uses data from a priority module to determine which CPU has a higher assigned priority, and identifies the higher-priority CPU as the correctly-functioning CPU. In either case, the computer system then provides the data held in the FIFO associated with the correctly-functioning CPU to the peripheral devices. By thus buffering the data output streams, the present invention allows the computer system to utilize the diagnostic procedures for increasing the probability of correctly identifying a CPU as malfunctioning.