RAS (Reliability, Availability & Serviceability) is a critical requirement for enterprise class servers. System uptime is measured against the goal of “five nines”, which represents 99.999% availability. The handling of soft errors to achieve this RAS goal is accomplished by addressing several different aspects of hardware and system software design, such as circuit and logic design, platform, firmware, and operating system (OS) design. The first priority is typically directed towards an attempt to minimize the actual occurrence of the soft errors at the hardware level within the practical constraints of device physics and logic/system design trade-offs. Automatic detection and correction of errors in hardware are the most preferred methods.
The occurrence of soft errors cannot be completely eliminated by good circuit design techniques, and at times, circuit design innovations are limited by practical bounds. In such cases, the most effective way to combat soft errors is to protect the processor internal structures, the memory subsystem, system bus, and I/O (input/output) fabric using various error protection, detection and correction techniques. Some of the most commonly used hardware techniques are through parity, ECC (error correction code), or CRC (cyclical redundancy check) protection schemes. When the detected software errors cannot be corrected by hardware through the above protection schemes, the responsibility of handling these errors is left to the system software with error log information provided by the underlying software layers. System hardware does not rely on software to actually correct the errors, but to take necessary corrective action from a software perspective (e.g., system reset, application termination, etc.)
Hardware error handling in most operating systems is a complex process today. The OS contains intelligence to parse some generic hardware error information based on standardized architecture registers or model specific registers (MSR's), classify the errors, and determine actions. However, the OS does not have intimate knowledge of the platform hardware topology and its register space, which would vary across different OEM's (original equipment manufacturer). Standardizing the platform hardware error registers is a possible solution. However, this solution requires both platform and processor hardware changes, and limits scalability, not to mention constant OS changes to support new platform capabilities that tend to evolve over time.
Some of the existing error handling architectures and implementations assume that certain system error functions are physically distinct and their scope is tied to either a processor or the platform. The error signaling and error reporting is tightly coupled to this structure and the OS is also expected to have the implied knowledge of what constitutes processor and platform functions. Due to integration of some of the platform hardware functions like the Memory Controller and North Bridge onto future processor sockets, the physical locality of the platform chip-set error entities are no longer deterministic across various implementations. This change in system design also requires an abstraction from an OS perspective. Therefore, it is desirable to abstract any implied knowledge of the underlying implementation in the separation of processor or platform error functions, from a system software viewpoint.
In addition, there are challenges due to different system software components managing errors for different platform hardware functions without any coordination with each other. Examples of this include error management through SMI—(System Management Interrupt) based firmware, system management controller (SMC) firmware, OS-based device drivers, etc. Some of these components are visible to the OS, while others are not.
Some of the errors managed by these platform entities may eventually get propagated to the OS level. Therefore, an OS is also expected to handle an assortment of hardware errors from several different sources, with limited information and knowledge of their control path, configuration, signaling, reporting, etc. This creates major synchronization challenges between different system software components. It would therefore be advantageous to have an architectural framework to facilitate coordination between the OS and other platform components for overall system error management.