This invention relates generally to computer systems, and more particularly to improving the serviceability of a memory system in the presence of defective memory parts.
Contemporary high performance computing main memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), the type and structure of the memory interconnect interface(s) and the type and efficiency of any failure detection/correction function associated with one or more elements of the system.
FIG. 1 relates to U.S. Pat. No. 5,513,135 to Dell et al., of common assignment herewith, and depicts an early synchronous memory module. The memory module depicted in FIG. 1 is a dual in-line memory module (DIMM). This module is composed of synchronous DRAMs 108, buffer devices 112, an optimized pinout, and an interconnect and capacitive decoupling method to facilitate high performance operation. The patent also describes the use of clock re-drive on the module, using such devices as phase-locked loops (PLLs).
FIG. 2 relates to U.S. Pat. No. 6,173,382 to Dell et al., of common assignment herewith, and depicts a computer system 210 which includes a synchronous memory module 220 that is directly (i.e. point-to-point) connected to a memory controller 214 via a bus 240, and which further includes logic circuitry 224 (such as an application specific integrated circuit, or “ASIC”) that buffers, registers or otherwise acts on the address, data and control information that is received from the memory controller 214. The memory module 220 can be programmed to operate in a plurality of selectable or programmable modes by way of an independent bus, such as an inter-integrated circuit (I2C) control bus 234, either as part of the memory initialization process or during normal operation. When utilized in applications requiring more than a single memory module connected directly to a memory controller, the patent notes that the resulting stubs can be minimized through the use of field-effect transistor (FET) switches to electrically disconnect modules from the bus.
Relative to U.S. Pat. No. 5,513,135, U.S. Pat. No. 6,173,382 further demonstrates the capability of integrating all of the defined functions (address, command, data, presence detect, etc) into a single device. The integration of functions is a common industry practice that is enabled by technology improvements and, in this case, enables additional module density and/or functionality.
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. The need for high-availability and minimal (if any) downtime presents further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering additional functions, increased performance, increased storage, lower operating costs, etc.
The use of enhanced error correction techniques has been a major factor in improving MTBF, however it is becoming increasingly difficult to identify a single failing module within a cluster (2, 4 or more) of modules that are accessed in parallel to service a cache line access (typically 64 bytes, 128 bytes or larger). This is particularly true in cases where the error is identified by the ECC structure as being “uncorrectable.” During scheduled or unscheduled repair actions, given the limited amount of failure data (especially in cases where an uncorrectable error has resulted in the repair action) and the need to bring the system online quickly, it is common for more than one memory module to be removed in response to an apparent memory system failure. It would be desirable to have the capability of quickly and accurately identifying specific failing memory modules in order to reduce the number of functional modules that are unnecessarily replaced.