It is desirable automatically to identify faulty units in computer systems that involve many peripheral units as well as processors, memories, etc. Telephony is a good example of an art in which fault identification and automatic system recovery has been developed to a high degree of sophistication. System redundancy and automatic detection of system malfunctions have long been employed in switching systems to enhance system reliability, even in older electromechanical systems. Redundancy, immediate fault detection and unit reconfiguration have continued to be the mainstay of system reliability of the computer controlled switching offices of today.
Many techniques are known for identifying the source of a system malfunction. In general, however, these techniques are based on system recovery algorithms of reconfiguring the system and retrying the operation on which a malfunction occurred until a successful completion of the operation is achieved. If an algorithm is designed properly, the system is then able by a process of elimination to identify the offending (faulty) unit. While techniques such as these are workable in general, they sometimes have certain disadvantages. For example, some subsystems may have complex paths of communication in which any given unit may appear in more than one path. Depending on the specific characteristics of such a subsystem, it may be difficult to design a recovery strategy based on the traditional reconfigure and retry strategy. Algorithm reliability, that is, the ability to consistently and accurately identify a faulty unit in a complex system or subsystem, may be inadequate. The amount of dedicated software required to recover a system and to identify a faulty unit may become burdensome in a complex environment. In some cases, the system time required to perform the recovery and identification steps may be intolerable, such as in a real-time environment of a telephone office.