Memory system reliability is a serious and growing concern in modern servers and blades. Existing memory protection mechanisms require one or more of the following: activation of a large number of chips on every memory access, increased access granularity, and an increase in storage overhead. These lead to increased dynamic random access memory (DRAM) access times, reduced system performance, and substantially higher energy consumption. Current commercial chipkill-level reliability mechanisms may be based on conventional Error-Correcting Code (ECC) such as Reed-Solomon (RS)-codes, symbol based codes etc. However, current ECC codes restrict memory system design to use of ×4 DRAMs. Further, for a given capacity, dual in-line memory modules (DIMMs) with narrow chips (i.e., I/O DRAM ×4 chips) consume more energy than those with wider I/O chips (i.e., ×8, ×16, or ×32 chips).
This non-availability of efficient chipkill mechanisms is one reason for the lack of adoption of wide input/output (I/O) DRAMs despite the advantages they offer. Second, current ECC codes are computed over large data words to increase coding efficiency. This EGG code handling results in large access granularities, activating a large number of chips or even ranks for every memory operation, and increased energy consumption. Area, density, and cost constraints can lead to overfetch to some extent within a rank of chips, but imposing additional inefficiency in order to provide fault tolerance should be avoided. The handling may potentially reduce bank-level and rank-level parallelism, which diminishes the ability of DRAM to supply data to high bandwidth I/O such as photonic channels. Finally, conventional ECC codes employ complex Galois field arithmetic that is inefficient in terms of both latency and circuit area.