In a system having a plurality of processor nodes, it becomes difficult to isolate failures. As pointed out, e.g., by U.S. Pat. No. 6,031,819, a failure detector and alarm may be provided at each node of a system to detect and isolate a failure, with the result that failure isolation is expensive. An alternative method is to provide a detailed map of the system and provide specific diagnostic routines to run on the system, but may require an external device such as a laptop, with special purpose software or hardware to communicate with the specific processors and aid in the diagnosis. In a distributed control system, such as is employed in an automated data storage library, the processor nodes may comprise a microprocessor and a non-volatile memory. Thus, the routines may be maintained on an external processor, such as a laptop PC, having a connector cable and diagnostic software, and the routines must be tailored to the specific configuration of processors of the distributed control system. The user must be familiar with the emulator software, and the diagnostic routines must be able to run over the emulator software. Further, the diagnostic routines must be supported to respond to changes in the underlying distributed processing system over time.
Such diagnostic routines can only locate “hard” failures, which still occur at the time that the diagnostic routines are being run. Failures that occur involving communication across a network, such as a multi-drop bus network, may be intermittent, so it is difficult for such diagnostic routines to locate and diagnose the failures. One example of such a failure comprises a loose pin which occasionally makes contact and occasionally comprises an open circuit.
Diagnosis of such failures therefore requires service cost to bring a trained user with an external processor to the system, to conduct the diagnostics, and to isolate and locate the failures. The system may be down, or may be unreliable, for the duration of the time between the first occurrence of a failure and the isolation, location and repair of the failure.