Data processing systems are used in virtually all areas of modern society. The ever-increasing dependence on such systems is leading to demands for increased system availability and lower mean time to repair. Therefore, improved techniques for the detection and isolation of faults are essential.
Various techniques for both fault detection and fault isolation are well known in the art. The earliest of these techniques employed diagnostic test software to assure the operability of the computer system before running an application program. More recently it has become the practice to augment such software with built-in hardware fault detection circuitry such that a fault can be detected when it occurs, even while the computer is running an application program.
A further improvement to the foregoing involves the addition of fault indicators on each of the replaceable subassemblies such as the printed circuit cards. Now when a fault occurs, the corresponding indicator is activated. In theory, a technician may address the situation simply by replacing the unit associated with the activated fault indicator.
The use of fault indicators alone does not solve all problems however. This is because when a fault occurs in a data processing system, the fault will likely propagate to other areas of the system. As a result, other faults indicators will be activated. This type of ripple effect may result in a large number of fault reports. Some type of analysis is needed to determine which fault is the source of the problem.
One way to address this situation is to use a timestamp method to aid in the foregoing analysis. According to this process, when a fault is detected, the contents of an associated counter/timer are recorded. Assuming all counters within the system are synchronized, the timer records can be used as timestamps that determine the chronological order of the faults. The first-occurring fault can thereby be identified as the likely source of all subsequently-reported problems.
This prior art approach has several limitations. First, multiple faults may be occurring within a short period of time. If timestamps are not captured at a high enough frequency, multiple faults may have the same timestamp. Moreover, some faults, such as clock and power faults, cannot readily be associated with accurate timestamps. Therefore, either no time indications will be present when these types of faults occur, or the timestamps that are available are likely inaccurate. Thus, an improved mechanism is needed to isolate the likely cause of multiple related failures.