Future memory technology requires strong error correction code (ECC) management because Raw Bit Error Rate (BER) becomes increasingly higher with memory technology scaling or in new/immature memory technology. Standard error-correcting code (ECC) dynamic random-access memory (DRAM) systems provide for automatic correction when a single data bit is in error and for guaranteed detection of two data bits in error. This capability is often referred to as Single Error Correction/Double Error Detection (SEC/DED).
ECC memory requires that some bits be dedicated to actual data and other bits dedicated to the ECC. DRAM devices, for example, are available in various data widths (number of data bits per device). For example, dual in-line memory modules (DIMMs) used in servers may be built using multiple ×4 (4 data bit), ×8, or ×16 DRAM devices.
Many types of errors that occur in DRAM devices only impact one data bit, regardless of the width of the device. However, some error modes will result in more than one data bit being in error, up to the entire data width of the device. Any of these multi-bit failure modes result in a fatal error for a SEC/DED memory system, because only a single bit can be corrected by standard ECC. As DRAM devices become denser, the percentage of errors that result in multibit failure increases. Chipkill correct is the ability of the memory system to withstand a multibit failure within a DRAM device and is widely used as a commercial solution on high-end servers to reduce system level BER.
FIG. 1 is a block diagram illustrating an example of a conventional Chipkill scheme based on Reed-Solomon Error Correction Code. Dual in-line memory modules (DIMMs) 100 are shown, each comprising 18 memory chips 102 (#0 through #17) that provide 4 bits each (×4 chips). To provide Chipkill corrected memory, each data bit of one of the memory chips 100 is included in a separate “ECC word” that is used by an ECC algorithm to provide error detection and correction.
The Chipkill may utilize 36 (18+18) 4-bit symbols from the two DIMMs to make a 144-bit ECC word 104 comprising 128 data bits and 16 ECC bits in lockstep mode (two memory channels operating as a single channel so that each write and read operation moves a data word two channels wide). Such a Chipkill scheme achieves single-symbol correcting (SSC) or single-chip error correction, and double-symbol detecting (DSD) or double-chip error detection. However, since two-DIMM lockstep (×144 bus-width) is required for this scheme, it reduces rank-level/bank-level parallelism by half and doubles prefetching energy with burst length of 8, compared with single DIMM non-lockstep operation.