Fault-tolerant systems have been produced for a variety of applications. Some systems achieve fault tolerance by including redundant computing systems, each of which serves as a standby replacement for some or all of the others. So long as the replacement component is identical to the component being replaced, few problems arise in the handling of such replacements. When the component being replaced has been altered during the operation of the data processing system, the replacement component must be similarly altered. If the original failed component or an earlier replacement is reintroduced into the system, a system malfunction will result. This problem is especially severe in data processing systems with large numbers of inherently similar components that are subject to change in distinct and persistent ways as the data processing system operates. For instance, in large parallel computing systems, nodes are employed that include microprocessors with individual disk drive memories. During operation of the parallel processing system, the disks in such nodes store data and thus become "personalized" in accordance with particular system control functions. If such a node fails, is replaced, and the replacement itself fails and is replaced, only the latest replacement has an up-to-date "personality". A means is required to ensure that none of the older versions are permitted to rejoin the system.
The prior art discloses a number of methods and systems for enabling failed part replacements. In U.S. Pat. No. 3,665,418 to Bouricius et al., a fault-tolerant computer system employing stand-by redundancy is described. In the event of a failure of a subassembly, a switching system enables a routing around the failed subsystem. In U.S. Pat. No. 4,633,467 to Abel et al., a system is described for enabling identification of a failed unit when the unit is "buried" within other units and is difficult to monitor. A probability listing is created that enables the fault to be assigned to the unit that is most probably inoperable. In U.S. Pat. No. 4,814,979 to Neches, shut down of one or more processors in a multi-processor system is immediately communicated throughout the system so that an interrupt sequence can be initiated.
In U.S. Pat. No. 4,412,281 to Works, a bus system is used, to which replacement parts are connected. By the expedient of changing an address on the bus, a replacement part can be substituted for a malfunctioning part and enables the continuation of system operations. A similar reassignment method is taught in U.S. Pat. No. 4,442,502 to Friend et al, wherein redundant devices are substituted for malfunctioning devices by the switching of assigned identities.
In U.S. Pat. No. 4,847,837 to Morales et al., a local area network is disclosed which can identify the existence of a fault or error condition in the network, isolate it and alert service personnel to the existence and location of the problem. In U.S. Pat. No. 4,815,076 to Denney et al., a system reconfiguration technique is described that provides several alternatives for recovering from single or multiple component failures. The system locates and tests one or more configurations of a failure scenario and presents possible reconfiguration scenarios in order of preference. US. Pat. No. 4,891,810 to deCorlieu et al., describes a reconfigurable computing system that includes redundant elements. Reconfiguration of the system involves substitution of the redundant element for a malfunctioning element. However, if the system is in a critical computing operation, reconfiguration is postponed to a later time. Chao in U.S. Pat. No. 4,866,712, describes a method and apparatus for fault recovery which includes an error table and action table. When an error count exceeds a threshold, corrective action is initiated in accordance with the aforesaid tables.
U.S. Pat. No. 3,805,039 to Stiffler and U.S. Pat. No. 4,920,497 to Upadhyaya et al. both teach redundant systems wherein inoperable elements are determined and the systems then maps its operations so as to avoid such inoperable elements Stiffler also teaches the use of spare sub-elements as substitutes for the mapped-around elements In U.S. Pat. No. 3,758,761 to Henrion, an electronic system "on a slice" is described wherein substitute redundant subsystems are provided on the slice and are enabled for substitution for malfunctioning subsystems by an external control circuit.
A consistent feature of the prior art is that the redundant replacement component is assumed to be a one-for-one replacement of a malfunctioned component. So long as the malfunctioned component is not personalized during its operation, this is a valid assumption. However, if personalization occurs during operations, a method and apparatus must be provided to enable the system to assure that any replacement is similarly personalized and that no improperly personalized replacement is used as a substitute.
Another problem that occurs with fault-tolerant systems is that a personalized component may malfunction on a transient basis, be replaced by a redundant unit, and at some later time, be reactivated after the transient malfunction has ended. Under such circumstances, the system must have a means for determining that an already-replaced component is attempting to reassert itself into the system. The system must also ensure that no other component has already been activated as a replacement. Under such circumstances, the system should normally ignore such a reassertion action, as the personalization state of the component attempting reinstatement in the system is probably not as up-to-date as the component that replaced it.
Accordingly, it is an object of this invention to provide a data processing system with means for determining a level of personalization of each replaceable component in the system.
It is another object of this invention to provide a fault-tolerant data processing system wherein component replacement is controlled so as to prevent reactivation of a previously failed component that has already been replaced.
It is still another object of this invention to provide a fault-tolerant data processing system which assures that any replacement component is intended for the particular system in which it is being inserted.