1. Field of the Invention
The present invention relates to the field of memory fault tolerance in high performance computing and more particular to memory redundancy through memory mirroring.
2. Description of the Related Art
Memory refers to an electronic device configured to store information—generally in binary form. Memory often refers to dynamic memory including dynamic random access memory (DRAM) in which charges within small capacitors are continually refreshed selectively to store high and low signals representative of binary values. In as much as the ultimate representation of a binary value is physical in nature, errors can occur. In that the integrity of values in memory reflect the most critical aspect of any computing system, it is imperative that errors in memory are identified immediately and handled accordingly.
Memory can be susceptible to two basic types of errors: hard and soft. Hard errors refer to structural imperfections in the memory module itself in which the memory stores a particular value irrespective of the value written to memory. In the case of a hard error, the symptoms are repeatable and persistent. Soft errors, by comparison, often are transient and result from logical imperfections in controlling reading to and writing from memory. In that soft errors generally are transient in nature, soft error can be very difficult to diagnose and correct.
Memory fault tolerance refers to mechanisms established to cope with hard errors and soft errors in memory in order to remediate and recover where possible. Memory fault tolerance often begins with the detection of a memory error. Early forms of detection utilize parity bit checking. The parity bit, as it is well-known, indicates either an even or odd parity for stored information. The parity of retrieved information can be compared to the parity bit value to readily detect an obvious failure where no match occurs. Error correction code (ECC) processing, by comparison, involves the generation of a single sum for all bits (checksum) that can be compared to the sum of all bits for retrieved information. Again, the failure to match will indicate a memory failure.
It is well-known that ECC utilization can be effective for small, isolated soft errors in memory. For very large systems, however, multiple, soft errors are more common and ECC utilization cannot correct so many soft errors. Accordingly, more advanced forms of fault tolerance have become prevalent in larger computing systems. The most common form of fault tolerance in the latter circumstance is memory redundancy. Memory redundancy refers to the deployment of more memory than utilized to provide for failover in the event of a faulty bank of memory. One such form of memory redundancy includes spare-bank memory, while another form includes memory redundant array of independent dual inline memory modules (RAID).
A third form of memory redundancy includes memory mirroring. Memory mirroring refers to the use of redundant copies of system memory. In memory mirroring, when a multi-bit error is detected when memory is accessed, the system instead of performing a hard stop, can immediately fail over to a read from a mirror image of the failed memory. Consequently, memory mirroring can provide a high level of fault tolerance. Notwithstanding, memory mirroring can consume substantial resources in that two or more channels of memory are required to achieve mirroring with one channel including memory allocated for operational data and one channel including memory allocated for duplicate data. Even in the instance where half of all memory across both channels is allocated for mirroring, half of the available memory bandwidth is consumed with writing duplicate data leaving only half of the available memory bandwidth for use in writing operational data.