1. Field of the Invention
The present invention relates generally to controlling access to computer memory systems and, more particularly, to restoring access to a failed data storage device in a redundant memory system.
2. Related Art
A computer memory module commonly includes a plurality of integrated circuits (ICs), each of which stores millions of binary digits (bits) of data. Most memory ICs store data bits in columns, rows and planes of memory cells, with each cell comprising a relatively small capacitor. When data is written to a memory cell, its capacitor is either charged to a predetermined voltage to represent a “1” bit, or the capacitor is discharged to represent a “0” bit. If the capacitor's charge changes significantly between the time data is written to the memory cell and the time the memory cell is read, data read from the memory cell will not correctly represent the data previously written to that cell. Such an occurrence is commonly referred to as a memory error.
Memory errors can be classified as hard or soft, depending on whether the errors occur repeatably or randomly. For example, a failed capacitor usually causes its memory cell to be read as a “0” regardless of whether a “1” or a “0” was written to the memory cell. Thus, a failed capacitor usually causes repeatable, or hard, memory errors. In contrast, random or soft memory errors are usually caused by sporadic events, most commonly cosmic rays. A sufficiently high-energy cosmic ray passing through a memory cell capacitor can change the capacitor's charge, altering data stored in the memory cell. Because of their relatively narrow beams, cosmic rays typically affect only one or a small number of memory cells of a memory module.
Progressively smaller capacitors have been used in successive generations of memory ICs, yielding higher densities of memory cells and, therefore, higher memory capacities. Unfortunately, such higher-density memory modules are more susceptible to cosmic ray-induced memory errors than their lower-density counterparts. Smaller capacitors require lower voltages to represent a “1” bit, enabling weaker cosmic rays to alter the contents of the memory cells. In addition, because such memory cells are more densely packed in the ICs, a single cosmic ray can pass through, and therefore affect, a greater number of capacitors than in lower-density memory ICs. Thus, higher-density memory ICs are more likely to incur soft memory errors and are more likely to incur multi-bit, as opposed to single-bit, soft errors than lower-density memory ICs.
Various protocols have been developed to manage memory errors. For example, some memory systems include capabilities similar to those used in redundant arrays of independent disk (RAID) storage systems. In the context of memory systems, the term “RAID” traditionally refers to redundant arrays of industry-standard DIMMs (dual in-line memory modules), although the term “RAIM” (redundant array of independent memory) is also commonly used to refer to such systems, and will be used herein. If one of the redundant storage devices (disk drives or memory modules) fails, the redundancy enables the memory system to use data from the surviving storage devices to reconstruct data stored on the failed device. This process of reconstructing lost data is commonly referred to as error correction.
A RAIM memory system uses a quantity of memory modules (typically four) to store data, and an additional (e.g., a fifth) memory module to store parity information. Data to be stored is divided into four blocks. Each block is stored in one of the data memory modules in a process commonly known as striping. Parity information calculated from the four blocks is stored in the parity memory module. When retrieving data from the memory modules, the ECC logic typically included in the RAIM system attempts to automatically correct detected data errors. If the error cannot be corrected (i.e., it is “uncorrectable”), the data fetched from the failed memory module is reconstructed using the data in the remaining three data memory modules and the parity information in the parity memory module. In addition, the RAIM memory system ceases reading (i.e., takes off-line) the memory module that incurred the uncorrectable error.
After a failed memory module is taken off-line, the remaining memory modules do not thereafter provide the redundancy necessary to be able to recover from an uncorrectable error. That is, if one of the three remaining memory modules, or the parity memory module, subsequently incurs an uncorrectable error, the RAIM memory system will be unable to reconstruct the data. Instead, it will signal an unrecoverable memory error, typically causing the host computer system to crash.