In today's world of ubiquitous servers, maintaining good server reliability and uptime is almost mandatory. To maintain significant system uptime, system designers build reliability, availability, serviceability, manageability (RASM) features to improve overall system reliability and availability. Thus, it is common to find various degrees of redundancy, error correction, error detection and error containment techniques employed at different levels in the system hierarchy. One of the most common types of system failure is attributed to system memory errors. Hence, the memory subsystem (especially dual in-line memory modules (DIMMs)) receives particular attention in this regard.
Though modern memory employs error correction code (ECC) to detect and/or correct single and double-bit errors, higher order multi-bit errors still pose a significant problem for system reliability and availability. Thus techniques like memory mirroring and memory migration are used to reduce the likelihood of system failure due to memory errors. Mirroring is typically performed statically by system hardware and firmware, which provides full redundancy for the entire memory range in a manner largely transparent to an underlying operating system/virtual machine monitor (OS/VMM). However, it is not very cost-effective and therefore tends to be deployed only on very high-end and mission-critical systems. This is so, since the effective usable memory is reduced to about half while power consumption for the same amount of usable memory is effectively doubled. Also, with the cost of memory being a significant percentage of overall hardware cost, doubling it for redundancy purposes alone poses practical challenges for wide adoption.
On a mission critical server, the system should never be shut down or experience a loss in operational state so that the server can achieve a performance uptime of 99.999%. Memory migration is another platform RAS flow that is triggered on a memory mirror replace or during controller-level memory sparing operations. For a memory mirror replacement, suppose that a memory node X and a memory node Y are set as a mirror pair in that both nodes store the same data, e.g., with X as the master and Y as the slave. For various reasons, system software can stop the mirroring, power down the master and let an administrator replace the master's memory node. Once replaced, the memory contents of the master and slave can be re-synchronized. This process is done via a memory migration (in which information stored on node Y is copied to node X). In controller-level memory sparing, a spare memory node that is in a non-mirrored configuration can also be present in the system. This spare node can be “spared” into another node if the other node fails. In this case, the contents of the outgoing memory node are copied over to the spare node via memory migration.
In memory mirroring mode, memory read requests go to the master and memory write requests are directed to both the master and the slave. If there is an uncorrectable error on the master, then the slave will fulfill the request. Basically, the slave has the exact copy of data and provides the redundancy. In the case of migration, all read requests are directed to the master and write requests are directed to both the master and the slave, similar to mirroring. But if there is an uncorrectable error on the master during the migration process, then the slave will not fill that read request as the slave does not have the data available, resulting in a fatal error and taking down the system. For a large memory configuration, the memory migration can and does take a significant amount of time. There is a reasonable probability that the master, that has already experienced certain correctable errors causing the migration event, will see an uncorrectable error, and in migration mode, such uncorrectable error will cause the system to crash.