The present invention is directed to fault-tolerant data processing systems and in particular to recovery from detected errors in systems that employ duplicated communications buses.
Fault-tolerant data-processing systems employ various types of redundancy in order to maintain availability in spite of single faults. One type of redundancy employed in certain such systems is the use of duplicated communications buses for conveying signals among devices coupled to those buses. In the absence of errors, bus devices that transmit information over the buses transmit identical signals over both buses, typically simultaneously. (There are usually only two duplicated buses, but those skilled in the art will appreciate that the teachings of the invention to be described below are also applicable to arrangements that use more than two "duplicated" buses.) Devices that use the information placed on the bus by a transmitting device typically take their information from only a selected one of the buses at any given time, but the bus selection may change if an error condition occurs. For instance, if an error is detected in the information on one of the buses, the bus devices may assume a state in which they thereafter "obey" (i.e., use the information from) the other duplicated bus.
This approach obviously depends on detecting errors, and circuitry for detecting errors can take many forms. The information placed on the buses can be encoded for error detection, for example, and the bus devices can monitor the bus information so as to detect code violations. Another approach is to compare bus-driver inputs with actual bus signals.
Particularly when the latter mechanism is employed, a device that detects an error must communicate that error's occurrence to all devices that may use information from the buses at one time or another. This may be done, for instance, by means of error-signal-carrying lines, which also form part of the devices' communications channel but may be separate from the duplicated buses. Such lines are generally provided with their own fault-tolerating mechanisms. For instance, error-indicating lines can be "triplicated" so that devices receiving signals on the triplicated error-indicating lines can ascertain their intended contents by majority vote.
If the bus devices are currently "obeying" the bus on which the information was found to be in error, they do not use that information but instead employ some error-recovery mechanism to insure that the information they use is correct. This often involves having the transmitting device retransmit the information.
In this context, retransmission does not necessarily consist of transmitting exactly the same information. For instance, the transmitting device may actually be, for instance, a pair of identical partnered devices that drive the bus in unison. In response to certain further error-detection circuitry, it may be concluded that the fault lies in one of the two partners, and the defective partner "removes" itself from the bus before transmission again occurs. So the retransmitted information differs from the originally transmitted information in that it is not corrupted by the defective partner.
In other cases, the retransmitted information is the same, but the information actually used is not. Specifically, if a receiving device has previously been "obeying" the bus on which the error occurred, the occurrence of an error on that bus will often cause it to change its "obey" state so that on retransmission it uses the information from the other bus, which typically is not in error.
In short, the strategy employed in such fault-tolerant systems is to have many devices check for errors and have them notify all devices when an error occurs on any bus. This enables a bus device to take appropriate action whenever it is notified of an error on the bus that it obeys. The resulting operation is quite robust in the face of various types of faults that would otherwise result in erroneous operation.