Memory integrity and reliability and are core aspects of computer server reliability, particularly for RAS (reliability, availability, and serviceability) servers. To insure that memory integrity is maintained, computer systems have various mechanism for error checking relating and otherwise monitoring the health of memory resources, such as using ECC (Error Correction Codes) at the hardware level. Other measures are typically implemented at the firmware level, such that an operating system only needs to responds to errors that are detected at the hardware and/or firmware levels, and does not have to proactively detect such errors. Another technique for ensuring memory reliability and availability is to provide mirrored ranks of memory, such that there a backup copy of each memory resource during normal operations so that if one or a portion of a memory resource fails, execution can continue without disruption by accessing memory resources that have not failed or become corrupted.
At some point in time, there is bound to be a failure. In response to detection of such a failure, a fail-over procedure is implemented so that memory accesses are redirected to the memory resource holding the “good” copy of data and a new backup copy is written to another memory resource not currently in use. This process becomes significantly more complicated under modern architectures employing multi-core processors with multiple cache levels and multiple memory controllers, such as found in today's high-reliability server architectures.
Current server memory controllers support only software-based mirror/memory migration. When a mirror fail-over occurs, an SMI (System Management Interrupt) is generated. A BIOS SMI handler puts the master node in memory migration mode and generates reads and writes targeting all the addresses mapped by the failing master node. There are many practical issues with this approach. Time spent within the SMI handler has to be limited to keep the OS responsive (for example, an SMI handler must be executed to completion within 150 μsec). Hence, only a small chunk of memory copy can be handled during each SMI cycle, and the whole memory migration process has to be spread over multiple recurring SMI cycles. This is highly inefficient and has serious drawbacks when handling error conditions such as ‘poison’ mode (i.e., a cache line with an error that has yet to be consumed) and other error conditions. In the current generation of Server CPUs, there is no indication of poison mode until the ‘poisoned’ data reaches the consumer. Therefore, with software-based migration, any poisoned data could potentially be consumed by the SMI handler during the migration process and could eventually lead to failure of the overall system. Current generation CPUs workaround this problem by using convoluted and complex approaches that are highly inefficient and keeps the Master/Slave nodes in migration mode longer than is actually required.