I. Field of the Invention
This invention relates generally to an improved method and apparatus for assuring that a digital computing system is operating error free and which provides information sufficient to localize the failed unit for repair in the event of a fault.
II. Discussion of the Prior Art
With the ever increasing complexity of digital computing systems and the also increasing requirements for better system availability and lower mean time to repair, improved techniques for the detection and isolation of logical malfunctions or faults are essential. Various techniques for both fault detection and fault isolation are well known in the art. The earliest of these techniques employed diagnostic test software to assure the operability of the computer system before running an application program. More recently it has become the practice to augment such software with built-in hardware fault detection circuitry such that a fault can be detected the instant it occurs, even while the computer is running an application program. Although this added fault detection hardware improved the confidence of error-free operation, it did not, by itself, improve the confidence of fault isolation. When the computer is running an application program, it can sense only that an error has occurred. The system must be stopped, a diagnostic routine loaded and run to isolate the fault. This can be a very time consuming process, particularly if the fault is of an intermittent nature.
A further improvement has been the addition of fault indicators on each of the replaceable subassemblies, i.e., printed circuit cards. Now when a fault occurred, it would be captured in a latch on the card which produced the fault and the latch would activate the indicator. In theory, all that was required to repair the fault was to replace the card with the lighted fault indicator. In practice, this did not provide good isolation because in addition to the true fault indication other fault indicators would be lit due to propagation of errors from the original fault.
In an ideal fault-capture system, every possible fault which might intermittently occur would be immediately sensed and all of the detailed information required to unambiguously isolate the fault to the lowest repairable assembly would be accumulated before the system was stopped. Further, there would be a corollary fault injection means to assure that all of the fault-capture circuitry was, in fact, operational. It is unlikely that any capture scheme will fully achieve this ideal in that the overhead of additional circuitry with its associated cost and complexity increases substantially as this ideal is approached. However, the instant invention provides a substantial improvement over the prior art in that fault isolation is substantially improved with relatively simple capture circuitry which imposes a minimum overhead penalty.