When a computer system fails, it is often very difficult to determine the exact cause of the failure. As those skilled in the art will appreciate, both the software and hardware components of a computer system can fail, though typically hardware components are not the initial suspect as they are expected to operate for long periods of time without any failures. Nevertheless, hardware components do occasionally fail.
On most computer systems, hardware components implement basic error reporting measures, including writing error codes into component-related registers or memory locations, or signaling the computer system, particularly the operating system, that a hardware error has occurred. Upon notification of the error, the computer system, particularly some portion of the operating system, takes some sort of responsive action. Depending on the nature of the hardware error, any of a variety of actions may be in order. For example, if the error was corrected by the system, such as by hardware logic that corrects simple memory bit errors, no immediate action need be taken. If the error represents an uncorrectable but localized issue, the computer system may respond by isolating the problem and avoiding its use. “Sparing out” a range of faulty memory locations, i.e., setting aside the faulty range as unavailable memory, is one example of isolating a localized problem, without actually “fixing” the faulty component (memory). On the other hand, if the hardware error is a catastrophic failure of a fundamental component, and/or one in which continued operation runs the risk of data corruption or permanent computer system damage, the appropriate action may be to shut down or immediately halt operation, gracefully or otherwise. Of course, there are numerous other actions that may be taken, depending on the type of error condition and the abilities of the computer system to cope with any particular errors.
In spite of the error reporting capabilities of hardware components, most computer operating systems fail to make effective use of available/reported hardware error information. There are several reasons that operating systems fail to make better and more effective use of the error information. One reason is that operating systems are frequently designed to operate over a wide variety of hardware platforms, including widely differing processor architectures. Unfortunately, virtually every hardware platform reports hardware errors in its unique manner. As such, an operating system's hardware error handling module must be tailored to a specific hardware platform. An operating system's hardware error handling module (or modules) may be further specialized to a particular hardware platform because this module or handler, as part of the operating system, operates entirely within the protected kernel mode of the operating system. As those skilled in the art will appreciate, updating the hardware error module after the operating system is installed is simply impractical.
Yet another reason why operating systems fail to make effective use of the available/reported hardware error information is that while a group of computer systems may have a common processor type, each “similar” computer system may surround the processor with differing supporting chip sets, and these chip sets frequently play a major role in reporting hardware errors. Similarly, computer system firmware and BIOS implementations also vary across similar systems, and may also play an important role in reporting hardware errors. Thus, while two computer systems appear to be the same, error handling modules may require that they be specifically tailored to a specific combination of processor, supporting chip set, and/or BIOS.
Still another reason why operating systems fail to make effective use of the available/reported hardware error information, even when such information is standardized, is that an operating system provider simply targets only that information which represents the “least common denominator” of information between platforms. That is, only hardware error information that is common across a variety of platforms is utilized. Unfortunately, by acting on the “least common denominator” of hardware error information, such actions are substantially inferior to those actions that might be taken if richer error information was available.
Due to the specific nature of hardware error reporting and the difficulty in tailoring more specific error handing to the numerous permutations of computer system architectures, current operating systems provide only the most generic hardware error handling modules. As a result, even though a substantial amount of hardware error information may be reported and/or available, it is largely underutilized, even to the point that some hardware errors may not be discovered, acted upon, or reported.
In light of the above issues with regard to hardware error reporting and recovery, what is needed is improved hardware error reporting and recovery capability integrated within an operating system. This improved hardware error reporting and recovery capability should allow the operating system to fully utilize the hardware error reporting and recovery capabilities of the underlying hardware platform. Improved hardware error reporting and recovery capability should allow the operating system to generate error records which describe error conditions in sufficient detail as to allow human and software agents to identify the root cause of a given error condition and the hardware component(s) to which the error condition is attributed. The error record should be encoded such that it is generally applicable to all possible hardware error conditions, enabling human and software agents to process the error record in a generic fashion across all permutations of hardware components and hardware platform configurations. Additionally, the improved hardware error reporting and recovery capability should provide both generic hardware handler components, which can be easily ported across a variety of processor architecture platforms, as well as platform-specific components that are tailored to a specific computer system, and which can be easily updated, extended, and improved without requiring difficult operating system modification. The present invention addresses these and other issues found in the prior art.