The invention relates to computer systems having first and second processing sets, each of which may comprise one or more processors, which communicate with an I/O device bus.
More particularly, the present invention relates to fault tolerant computer systems in which the first and the second processing sets are arranged to communicate with the I/O device bus in a step locked manner, with provision for identifying lockstep errors in order to detect faulty operation of the computer system.
Generally it is desirable to provide fault tolerant computer systems with a facility for not only detecting faults, but also for automatically recovering from the detected faults. By detecting and recovering from the detect faults, the computer system is provided with higher degree of system availability.
Automatic recovery from an error provides significant technical challenges. This is because a computer system has to be arranged to continue operating following fault detection to the effect of maintaining functional system performance of the system, whilst permitting diagnostic operations to be performed to locate and remedy the fault.
The applicant has disclosed in a co-pending international patent application Ser. No. US99/124321, corresponding to U.S. patent application Ser. No. 09/097,485, a fault tolerant computer system in which first and second processing sets are connected via an I/O device bus to a bridge. The bridge operates to monitor the step locked operation of the processing sets, each processing set being arranged to operate in accordance with substantially identical software. The software includes for example an operating system of the computer system. If the bridge detects that one of the first or the second processing sets departs from the step locked operation with respect to the other processing set, then it is assumed that a fault or error condition has occurred. Interrogation and analysis is then performed by the operating system following error reports from the bridge. The operating system determines which of the first or the second processing sets is in error, or whether another device was in error, and takes corrective action. This fault detection and diagnosis is provided through the locked stepped operation of the processing sets as monitored by the bridge.
Although the step locked operation provides an effective system for detecting faults, because the two processing sets are effectively operating autonomously some conditions can occur in which the two processing sets do not operate in lock step even though an error has not occurred.