In today's world of ubiquitous servers, maintaining good server reliability and uptime is almost mandatory. To maintain significant system uptime, system designers build reliability, availability, serviceability, manageability (RASM) features to improve overall system reliability and availability. Thus, it is common to find various degrees of redundancy, error correction, error detection and error containment techniques employed at different levels in the system hierarchy. One of the most common types of system failure is attributed to system memory errors. Hence, the memory subsystem (especially dual in-line memory modules (DIMMs)) receives particular attention in this regard.
Though modern memory employs error correction codes (ECC) to detect and/or correct single and double-bit errors, higher order multi-bit errors still pose a significant problem for system reliability and availability. Thus techniques like memory mirroring are used to reduce the likelihood of system failure due to memory errors. Mirroring is typically performed statically by system firmware, which provides full redundancy for the entire memory range in a manner largely transparent to an underlying operating system/virtual machine monitor (OS/VMM). However, it is not very cost-effective and therefore tends to be deployed only on very high-end and mission-critical systems. This is so, since the effective usable memory is reduced to about half while power consumption for the same amount of usable memory is effectively doubled. Also, with the cost of memory being a significant percentage of overall hardware cost, doubling it for redundancy purposes alone poses practical challenges for wide adoption.
Memory mirroring thus provides two identical copies (also referred to as mirrors). If one portion of the memory goes down or breaks, the other can provide requested data so that code and data integrity is preserved. A technique has been proposed to utilize mirroring on a more granular scale of less than half of the total memory space and to allow the OS to direct the final mirrored size. However, this does not fully solve platform problems. For example, assume that in a partially mirrored system, the OS creates a small memory mirror of less than half the memory. If the mirror breaks, e.g., due, to an uncorrectable error, that memory range will continue in a non-redundant state until the mirror is reconfigured on a subsequent power on self test (POST).