Server systems (especially expandable scalable and mission critical systems) have been built with error detection, error correction, error recovery and error logging capabilities for a better Reliability, Availability and Serviceability (RAS) experience.
Machine Check Architecture (MCA) is a mechanism used in Intel Xeon® processors to report processor related errors to upper layer software. The Machine Check Architecture has typically been limited to processor component related errors such as cores, caches, etc. With integration of a memory controller, PCIe root complex, etc. into the Xeon® Processor Socket, however, the scope of the Machine Check Architecture has been reported to comprehend error handling and RAS events from all these components. Another advance in the Machine Check Architecture is error recovery, where the errors that are uncorrectable from a processor/platform standpoint are provided to the Operating System (OS) or the Virtual Machine Monitor (VMM) for recovery.
In light of the above innovations, it is important for Operating System Vendors (OSVs) and/or Virtual Machine Vendors (VMVs) to verify comprehensively that error handling, error reporting and error recovery on platforms work as expected by the Operating System (OS) and/or Virtual Machine Monitor (VMM). This is referred to as RAS/error handling code coverage. The code coverage allows for the OS and/or VMM to ensure that the error handling, recovery and RAS features do indeed work end-to-end. This involves generating different type of errors such as errors on different processor cores, processor caches, memory, input/output devices, system interconnects, etc. In order to verify error handling capabilities the error injection capability needs to be available to the OS and/or VMM at runtime on a production platform. For hardware to implement real error injection mechanism on various error subsystems and for all the error cases, it is very complex and very expensive. Additionally, the error injection mechanism itself should not pose a risk to the security of the system.
Error handling code is currently validated with a limited set of real error injections. The real error injection mechanisms allow injecting only one error at a time. Therefore, the error handling code gets only limited error code execution coverage. It is then necessary to perform many manual code reviews in order to determine additional code coverage paths.
Some previous implementations allow writing MCA banks from ring0 code by writing to another ring0 code accessible register. This is a security hole. Since the MCA bank write enabling register is in the ring0 domain, any ring0 code can go and enable the MCA bank write capability and write to the MCA bank and cause unwanted error cases. Further, no previous implementations provide for signaling of errors such as Machine Check Architecture (MCA) and Corrected Machine Check Error Interrupt (CMCI), limiting the feature usefulness of the implementation.
The present inventors have additionally identified that current solutions do not provide protection of MCA bank write enabling/disabling capability, or protection of error signaling (MCA/CMCI) generation enabling/disabling.