This invention relates to computer error handling. More specifically, the invention relates to a firmware-based error handling mechanism to support the creation, storage and retrieval of customized and extendible error records in computer platforms.
Modern computers are designed to monitor their own performance and frequently to test themselves to assure that operations have been performed properly. When a fault occurs, a machine interrupt typically is issued, and the hardware and software attempt to locate and identify the error. Depending on the severity of the error, control programs may shut down the entire machine, may avoid use of the faulty component, or may simply record the fact that an error has occurred.
System error detection, containment and recovery are critical elements of highly reliable and fault tolerant computing environments. While error detection is primarily accomplished through hardware mechanisms, system software plays a greater role in the containment and recovery of errors. The degree to which overall error handling is effective in maintaining system integrity depends upon the level of coordination and cooperation between the system CPUs, platform hardware fabric, and system software. Vendors of such computer systems therefore have developed maintenance and diagnostic facilities as part of their computer platforms. When a system failure occurs, diagnostic software may attempt to determine the cause of the failure and may also attempt to store information describing the failure, so that subsequent efforts to resolve or eliminate the failure may benefit from the stored information.
In the prior art, software-based error handling mechanisms of the type described have traditionally resided in a portion of the computer operating system. As a result, operating system designers have been required to develop unique error handling subsystems for each supported computer platform. Because of this constraint, computer error handling capabilities have been relatively limited in the prior art. In particular, designers of multiple computing environments have been forced to isolate the error management functions of each component operating system. Similarly, designers of complex computer platforms having multiple domains and/or partitions have been forced to deploy separate and isolated error management systems. Additionally, Original Equipment Manufacturers (OEMs) have been restricted in their ability to develop customized computer platforms that provide enhanced maintenance capabilities.
Accordingly, there is a need in the art for a unified and standardized approach to computer error handling at the firmware level, outside the traditional sphere of an operating system. Such an error handling mechanism would allow computer platform designers and operating system engineers to develop standard error management subsystems that make effective use of common interfaces and methods. A standard error handling mechanism would also permit OEMs to develop error parsers, utilities and enhanced maintenance diagnostics that do not depend on the specific features any particular operating system.