1. Field of the Invention
Embodiments of the present invention generally relate to techniques for increasing the overall reliability of memory systems that utilize modular memory modules.
2. Description of the Related Art
Computer system performance can be increased by increasing computing power, for example, by utilizing more powerful processors or a larger number of processors to form a multi-processor system. It is well established, however, that increasing memory in computing systems can often have a greater effect on overall system performance than increasing computing power. This result holds true from personal computers (PCs) to massively parallel supercomputers.
High performance computing platforms, such as the Altix systems available from Silicon Graphics, Inc. may include several tera-bytes (TBs) of memory. To provide such a large amount of memory, such configurations may include many thousands of modular memory modules, such as dual inline memory modules (DIMMs). Unfortunately, with such a large number of modules in use (each having a number of memory chips), at least some amount of memory failures can be expected. While factory testing at the device (IC) level can catch many defects and, in some cases, replace defective cells with redundant cells (e.g., via fusing), some defects may develop over time after factory testing.
In conventional systems, a zero defect tolerance is typically employed. If a memory failure is detected, the entire module will be replaced, even if the failure is limited to a relatively small portion of the module. Replacing modules in this manner is inefficient in a number of ways, in addition to the possible interruption of computing and loss of data. On the one hand, the replacement may be performed by repair personnel of the system vendor at substantial cost to the vendor. On the other hand, the replacement may be performed by dedicated personnel of the customer, at substantial cost to the customer.
A costly solution to increase fault tolerance is through redundancy. For example, some systems may utilize some type of system memory mirroring whereby the same data is stored in multiple “mirrored” memory devices. However, this solution can be cost prohibitive, particularly as the overall memory space increases. Another alternative is to simply avoid allocating an entire defective device or DIMM from allocation. However, this approach may significantly impact performance by reducing the available memory by an entire device or DIMM, regardless of the amount of memory locations found to be defective.
Accordingly, what is needed is a technique to increase the overall reliability of memory systems that utilize modular memory modules.