The present invention relates generally to error handling in computer systems, and more particularly to check stop error handling in such systems.
When a hardware fault is detected in a digital computer system, sometimes the fault is so severe or the risk of data corruption so great that detection of the error is designed to cause an immediate halt of further operations. Except for performing a complete system reset, there is no means of recovering from this state, which is typically called a Check Stop state. Because of the severity of the error, it is important to be able to determine the source of the error so that the failing component can be replaced quickly and the system restored to normal operation.
However, since the main processor is stopped in this condition, a separate processing mechanism is needed to capture failure information. The mechanism is usually referred to as a Service Processor, which provides embedded controller operations that remain even when check stop failures occur. Unfortunately, sophisticated processing mechanisms are needed to extract failure information from the failing components when all the normal functional paths are frozen and perform analysis on the information. Including such sophisticated processing mechanisms, however, increase the system""s costs.
Further, typical systems contain very large amounts of error data in the form of latch bits. An engineering change to add even a single new latch bit changes the layout of an entire scan string of data and increases the amount of data needing to be extracted. Providing sufficient storage space to hold the increased data further adds to overall system costs.
Accordingly, what is needed is a capable system for check stop error analysis and handling that functions on low-end computer systems, utilizes a basic, low-cost service processor, and requires relatively small storage space.
These needs are met through the present invention which provides method and system aspects for check stop error handling. A method aspect for check stop error handling in a computer system, the computer system comprising a plurality of components including a processor that supports an operating system and firmware, includes utilizing a service processor following a check stop error for error data retrieval and attempting a reboot of the computer system. The method further includes initiating firmware for failure reporting based on the error data retrieval when the reboot is successful. In another method aspect, the method includes performing error data retrieval from fault isolation registers of the plurality of components using a service processor following a check stop error, and transforming the error data into an abstracted error log via the firmware after a successful reboot.
In a system aspect, a computer system with check stop error handling includes a processing mechanism, the processing mechanism supporting an operating system, and a service processor coupled to the processing mechanism, the service processor performing error data retrieval following a check stop error. The system further includes a firmware mechanism supported by the processing mechanism, the firmware mechanism performing failure reporting based on the error data retrieval.