As today's computer systems are continually enhanced with respect to speed and computing power, the intricate nature and complexity often increases in parallel. To maintain a high degree of reliability in such systems, error processing is used to locate, report, and act on system faults and faults within particular components within the system. Error processing is important during the real-time execution of the computer system, but is also important during system test.
Various testing mechanisms have been used to help locate system or component errors, and often to assist in error recovery. Many testing mechanisms are most beneficial where a system or unit is not operating in real time. For example, test fixtures using bed-of-nails and in-circuit testing provides adequate testing on a component or trace level. Furthermore, scan testing is often used for custom VLSI chips and ASICs, because the internal signals simply are not accessible. Scan methods considers any digital circuit to be a collection of registers or flip-flops interconnected by combinatorial logic where test patterns are shifted into a large shift register organized from the storage elements of the circuit.
It is often the case that testing is most beneficial during real-time execution of the system, where actual signals can be used rather than simulated signals. Furthermore, error detection and recovery during normal computer operation is necessary to maintain a high level of system reliability. For example, error processing in a mainframe computer system may be configured to generate interrupts to a special error processing device, either during diagnostic testing or during normal computer operation. This special processing device is referred to as a support controller, which is designed to take recovery actions based on the particular unit within the system that failed or generated a fault condition. The support controller may itself monitor a particular unit, or may act on a reported error originating from a particular unit or "watchdog" hardware/software component which monitors for specific faults and reports them to the support controller.
It is not always desirable, however, to allow the support controller to perform unit recovery, since this recovery can be intrusive, interrupting a unit's normal operations. For example, sometimes the support controller must stop a unit's clock to perform a scan operation, which clears an existing error condition. Moreover, support controller-initiated recovery is often unacceptably slow. Finally, certain unit errors cause the support controller to remove the affected unit from the running partition, which requires operator intervention before the unit is again operational.
The problems discussed above become particularly evident during hardware test, and even more particularly where bus designs are involved. Design errors can cause problems which can not be addressed using support-controller-based recovery. Assume, for example, a hardware problem is discovered during testing of an input/output (I/O) unit. The error may result in the latching of a "Bus Error" condition being incorrectly set every time a certain sequence occurs on the I/O bus. Although the support controller can reinitialize arid restart the unit, testing cannot continue, as the test procedure that generated the error continuously stops and restarts the testing procedure.
A less intrusive recovery mechanism is therefore needed so that testing can continue despite known hardware problems. Ideally, such a mechanism would allow bypassing particular unit errors at the unit level, so that the support controller does not require notification every time that a persistent hardware fault occurs. The present invention provides a flexible, bifurcated error processing scheme which allows particular unit errors to be managed separately so that these particular errors can be masked to allow testing to be completed.
The flexibility of using a bifurcated error processing scheme as in the present invention further paves the way for providing other testing schemes which allow testing to be completed without interruption. The present invention further provides a bus analyzing scheme which takes advantage of the bifurcated error processing scheme of the present invention.
During hardware testing, logic state analyzers are often used to discover logic problems. A logic analyzer in its simplest form is a multichannel oscilloscope used to detect logic states. Logic analyzers also include memory for shifting in data and storing a predetermined number of bits for each analyzer channel. Such logic analyzers can be particularly helpful in analyzing bus signals. In some prior art systems, custom-built analyzers have been used to test the I/O bus and I/O units.
While custom-built analyzers provide bus visibility, they have several disadvantages. First, they may occupy a card slot in the computer system such that the computer cabinet involved in the test can not be fully populated. Also, the analyzers are not monitored from the operator's console, therefore requiring a person to gather test results by going back and forth between the console and the analyzer located in the I/O cabinet. This inconvenience may increase the time needed to complete testing.
Furthermore, if a failure occurs during normal system operation and bus analysis is required, the cabinet housing the failed component must be powered down so that an analyzer can be inserted in an empty card slot of the system. This is because an analyzer is not normally resident in the machine. After the cabinet is reloaded and reinitialized, the error must be recreated, which is often very difficult to accomplish.
Finally, information stored in a custom incabinet analyzer is not available for unit recovery. For example, a particular unit can not read the data stored in the logic analyzer to determine the fault type, which would be required to take the appropriate recovery action.
The present invention overcomes these problems by providing a bus history stack which is used in conjunction with a programmable bus analyzer. The testing flexibility of the bifurcated error processing scheme, in conjunction with the programmable bus analyzer and bus history stack, provides for a more user-friendly yet powerful testing and error analysis/recovery environment. The present invention therefore provides a solution to the aforementioned and other problems, and offers other advantages over the prior art.