The invention relates generally to fault tolerant computer systems such as lockstep fault tolerant computers which use multiple subsystems that run identically.
In such lockstep fault tolerant computer systems, the outputs of the subsystems are compared within the computer and, if the outputs differ, some exceptional repair action is taken.
FIG. 1 of the accompanying drawings is a schematic overview of an example of a typical system, in which three identical processing (CPU) sets 10, 11, 12 operate in synchronism (sync) under a common clock 16. By a processing set is meant a subsystem including a processing engine, for example a central processing unit (CPU), and internal state storage.
As shown in FIG. 1, the outputs of the three processing sets 10, 11, 12 are supplied to a fault detector unit (voter) 17 to monitor the operation of the processing sets 10, 11, 12. If the processors sets 10, 11, 12 are operating correctly, they produce identical outputs to the voter 17. Accordingly, if the outputs match, the voter 17 passes commands from the processing sets 10, 11, 12 to an input/output (I/O) subsystem 18 for action. If, however, the outputs from the processing sets differ, this indicates that something is amiss, and the voter causes some corrective action to occur before acting upon an I/O operation.
Typically, a corrective action includes the voter supplying a signal via the appropriate line 14 to a processing set showing a fault to cause a "change me" light (not shown) to be illuminated on the faulty processing set. The defective processing set is switched off and an operator then has to replace it with a correctly functioning unit. In the example shown, a defective processing set can normally be easily identified by majority voting because of the two-to-one vote that will occur if one processing set fails or develops a temporary or permanent fault.
However, the invention is not limited to such systems, but is also applicable to systems where extensive diagnostic operations are needed to identify the faulty processing set. The system need not have a single voter, and need not vote merely I/O commands. The invention is generally applicable to synchronous systems with redundant components which run in lockstep.
A particular problem exists when each processing set itself consists of multiple independently replaceable units. While it may be easy to identify the faulty processing set, it may not be so easy to locate the particular faulty module within that processing set. It is highly desirable, for cost reasons, to replace just the single module rather than a whole processing set.
FIG. 2 shows a processing set made of multiple modules which, in this example, comprise modules M0-M3 and an input/output IOM. Processing set 11 and 12 are identical to processing set 10. In a lockstep system, the lockstep modules have to be synchronous to a common clock so that they do not get out of step. Each processing module in FIG. 2 operates synchronous with this clock, and processing module M0 in processing set 11 is normally operating identically to processing module M0 in processing set 10. The operation of such a synchronous module should be determined at all times by the inputs presented to the module and the internal stored state of the module. The stored state depends, in turn, on all the inputs presented to the module since the module started. In a lockstep system, both the inputs to processing module M0 and the internal stored state of processing module M0 are identical on all the processing sets, unless there is a fault.
FIG. 3 is a schematic representation of the processing module M0, which includes a processing or computation unit 22 and internal state storage 24, where the internal stored state depends on the inputs 26 and contributes to the outputs 28. The stored state depends on the design of the module M0 and, potentially, on all the inputs that the module M0 has received. Each of the processing modules 10, 11 and 12 are identical. The processing modules are all clocked in response to a common clock input to the processing module at the clock input 30.
When a fault occurs within one of the modules M0-M3 of processing set 10, it is processing set 10 as a whole that is discarded. However, it may be that a single faulty module actually needs replacement before processing set 10 can be brought back into operation. The difficulty is to identify the faulty processing module.
An aim of the present invention it to provide a mechanism for locating a faulty module in a fault tolerant computer system.