1. Field of the Invention
The present invention relates in general to the field of information handling system operations, and more particularly to a system and method for information handling system error recovery.
2. Description of the Related Art
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
As information handling systems manage increasingly complex and critical functions, manufacturers have sought to improve system reliability in order to minimize disruptions that might result from system failures. A number of management subsystems monitor operating conditions of an information handling system to detect and correct errors before system failure occurs. One example of such a management subsystem is a System Management Interrupt (SMI) handler (SMI handler) running as firmware instructions on an information handling system, such as in the Basic Input/Output System (BIOS), to perform a variety of error handling functions related to memory. For example, an SMI handler running in BIOS on a server information handling system chipset typically maintains logs of correctable memory errors, uncorrectable memory errors, PCI and PCI-E errors and chipset errors. Typically, multiple correctable errors in a system are a precursor to uncorrectable errors, so the SMI handler uses logged errors to initiate error handling functions such as spare memory copy and memory RAID/mirroring. For example, spare memory copy, also known as sparing, switches to a spare rank of memory when a threshold number of correctable errors are detected. Sparing helps prevent uncorrectable errors that will hang the information handling system by relying on memory within the system that is not associated with logged errors.
One difficulty with error handling by SMI handlers is that code of the SMI handler typically relies on memory to perform error handling. For example, BIOS SMI code is typically located at a constant memory location within an information handling system from which memory management functions including error handling are performed. When correctable errors are detected within a memory DIMM where BIOS SMI code is located, the errors may become uncorrectable before the BIOS SMI handler can take appropriate corrective action, such as initiating sparing or mirroring. Once the errors become uncorrectable, the SMI handler may be unable to initiate RAS features correctly if SMI handler code stored in the memory becomes corrupt. Sparing to correct errors associated with SMI handler code will not prevent system failure if the sparing is not performed before errors become uncorrectable. Mirroring can recover from uncorrectable errors, however, mirroring typically needs hardware and chipset support and places a burden on the memory present in the system.