This invention relates generally to computer systems, and more particularly to improving the error detection in a memory system through the use of syndrome trapping.
Contemporary high performance computing main memory systems are generally composed of one or more dynamic random access memory (DRAM) devices which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), the type and structure of the memory interconnect interface(s) and the type and efficiency of any failure detection/correction function associated with one or more elements of the system.
FIG. 1 relates to U.S. Pat. No. 5,513,135 to Dell et al., of common assignment herewith, and depicts an early synchronous memory module. The memory module depicted in FIG. 1 is a dual in-line memory module (DIMM). This module is composed of synchronous DRAMs 108, buffer devices 112, an optimized pinout, and an interconnect and capacitive decoupling method to facilitate high performance operation. The patent also describes the use of clock re-drive on the module, using such devices as phase-locked loops (PLLs).
FIG. 2 relates to U.S. Pat. No. 6,173,382 to Dell et al., of common assignment herewith, and depicts a computer system 210 which includes a synchronous memory module 220 that is directly (i.e. point-to-point) connected to a memory controller 214 via a bus 240, and which further includes logic circuitry 224 (such as an application specific integrated circuit, or “ASIC”) that buffers, registers or otherwise acts on the address, data and control information that is received from the memory controller 214. The memory module 220 can be programmed to operate in a plurality of selectable or programmable modes by way of an independent bus, such as an inter-integrated circuit (I2C) control bus 234, either as part of the memory initialization process or during normal operation. When utilized in applications requiring more than a single memory module connected directly to a memory controller, the patent notes that the resulting stubs can be minimized through the use of field-effect transistor (FET) switches to electrically disconnect modules from the bus.
Relative to U.S. Pat. Nos. 5,513,135, 6,173,382 further demonstrates the capability of integrating all of the defined functions (address, command, data, presence detect, etc) into a single device. The integration of functions is a common industry practice that is enabled by technology improvements and, in this case, enables additional module density and/or functionality.
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximize overall system performance, density and reliability by enhancing one or more of the memory controller, overall memory system and/or subsystem design and/or structure. The increasing need for high-availability and minimal (if any) system down-time presents further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering additional functions, increased performance, increased storage, lower operating costs, etc.
The use of enhanced error correction techniques has been a major factor in improving MTBF; however the added cost for the memory devices required to store the necessary error code correction (ECC) check bits is excessive for some markets—with this situation being further aggravated by increased minimum memory device burst lengths with emerging memory technologies (e.g., DDR3 and DDR4 SDRAMS). A memory system solution that will ensure a high degree of data integrity, at minimal overhead (e.g., cost, performance, etc.) would be highly desirable for systems such as high-end desktop computers/laptops, low-end servers and other computer and non-computer systems. Other types of systems include, but are not limited to communications, I/O and printer systems.
Reliable storage and/or transmission systems are generally designed so that identification and correction of failures is assured with certainty, or at least with a very high probability. The ECC systems are further designed with the intent that the probability of error mis-correction is extremely small or zero for specific classes of error patterns. For example, a code with minimum distance 4 allows for the simultaneous correction of any one error and the detection of any two errors, whereas a code with a minimum distance 3 allows for single error correction with some mis-correction possible with double failures.
In some instances, the amount of memory bits (e.g. redundancy) that can be invested to protect data to be stored or transmitted is limited. For example, high-end desktops and low-end servers may decide that an ECC code having a minimum distance of 3, which does not allow for the detection for any two errors, is the maximum affordable solution (e.g., the best cost/performance tradeoff). This maximum affordable solution generally results in a measurable data integrity risk due to the inability of the code to reliably detect two or more errors. It would be desirable to decrease the data integrity risk associated with the maximum affordable solution.