This invention relates generally to computer memory systems, and more particularly to a memory system of semiconductor devices in a modular architecture with high availability characteristics that are realized through the use of partial ranks, multiple memory channels, and/or concurrently accessible partial ranks that minimize the impact of failures. Using the inventive features described herein enables the memory system to continue to operate unimpaired in the presence of a full memory module failure.
Contemporary high performance computing memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. DRAMs may be organized as one or more dual in-line memory modules (DIMMs). Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. High-availability systems present further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering substantially greater system memory capacity, additional functions, increased performance, reduced latency, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate the memory system design challenges, and include such items as ease of upgrade and reduced system environmental impact (such as space, power and cooling).
As computer memory systems increase in performance and density, new challenges continue to arise. For example, random access memory (RAM) devices of a computer system may include hundred of trillions of bits. A failure of a single RAM bit can cause the entire computer system to fail when error correction circuitry (ECC) is not utilized. It is most common for ECC to correct more minor failures, such as single bit, single symbol and some minor multi-bit or multi-symbol failures. ECC most commonly used in memory systems cannot correct for full memory module (DIMM) failures. ECC capable of correcting for full DIMM failures has not been exploited, because it would result in design trade-offs deemed unacceptable (e.g., cost, larger cache line sizes, reduced performance, etc.). When hard errors occur, such as single cell, multi-bit, full chip or full DIMM failures, all or part of the system RAM may remain down until it is repaired.