In prior multiprocessing systems fault-handling mechanisms are provided which include error detection, diagnosis, logging, and reporting of the fault to higher levels of the system. System resources detect the presence and extent of the failure and pass this information to recovery mechanisms. The recovery mechanisms are employed, after fault detection, to activate redundant components to take over operations previously handled by the faulty component. An example of such a system is described in the following patents, all of which are assigned to Intel Corporation:
U.S. Pat. No. 4,438,494 "Method and Apparatus of Fault Handling in a Multiprocessing System" by David Budde et al, filed on Aug. 25, 1981;
U.S. Pat. No. 4,503,534 "Apparatus for Redundant Operation of Modules in a Multiprocessing System" of David Budde et al, granted on Mar. 5, 1985; and
U.S. Pat. No. 4,503,535 "Apparatus for Recovery from Failures in a Multiprocessing System" of David Budde et al, granted on Mar. 5, 1985.
In these prior patents, appropriate response to hardware-error conditions is based upon a confinement area concept which partitions the interconnect system of the multiprocessor into a number of areas. The confinement areas provide error-detection mechanisms appropriate to deal with the kind of information flowing across the confinement area boundaries.
There is a confinement area for each module and memory bus in the system. A detected error is confined to one of the system building blocks. This allows a recovery mechanism to effectuate the replacement of system building blocks. Detection mechanisms reside at every interface, such that all data is checked as it flows across the interface between confinement areas. Error detection within a confinement area is performed by duplicating components as described in U.S. patent application No. 4,176,258 of Daniel Jackson, granted on Nov. 27, 1979 and assigned to Intel Corporation. In the Jackson patent, detection of errors is accomplished by a redundancy method known as functional redundancy checking (FRC), in which a component is duplicated and output signals from the two identical components are compared.
Functional redundancy checking is not very effective as a recovery mechanism, but for many users it is important to keep the system running, even though there is a risk that the data will lack integrity. For these users it is desirable to be able to reconfigure the system to keep it running when one of the components has failed as a result of a functional redndancy check, even if there is a loss of data integrity.
It is therefore an object of this invention to provide a redundant module checking system in which a faulty module can be taken out of the system and the system reconfigured to operate with the nonfaulty module.