A vital component of virtually all computer systems is a semiconductor or solid-state memory system. The memory system often holds both the programming instructions for a processor of the computer system, as well as the data upon which those instructions are executed. In one example, the memory system may include one or more dual in-line memory modules (DIMMs), with each DIMM carrying multiple dynamic random access memory (DRAM) integrated circuits (ICs). Other memory technologies, such as static random access memories (SRAMs), and various memory organizational structures, such as single in-line memory modules (SIMMs), are also employed in a variety of computer systems. In addition, one or more processors may be coupled with the memory modules through a memory controller, which translates data requests from the processor into accesses to the data held in the memory modules. In addition, many systems provide one or more levels of cache memory residing between the memory modules and the memory controller to facilitate faster access for often-requested data.
Computer systems have benefited from the ongoing advances made in both the speed and capacity of memory devices, such as DRAMs, employed in memory systems today. However, increasing memory data error rates often accompany these advancements. More specifically, both “hard errors” (permanent defects in a memory device, such as one or more defective memory cells) and “soft errors” (data errors of a temporary nature, such as inversion of data held within one or more memory cells) tend to become more prevalent with each new technology generation. To combat these errors, memory controllers in commercial computer systems now commonly support an error detection and correction (EDC) scheme in which redundant EDC data is stored along with the customer, or “payload,” data. When these data are then read from the memory, the memory controller processes the EDC data and the payload data in an effort to detect and correct at least one data error in the data. The number of errors that may be detected or corrected depend in part on the power of the EDC scheme utilized, as well as the amount of EDC data employed compared to the amount of payload data being protected. Typically, the more EDC data being utilized, the higher the number of errors capable of being detected and corrected, but also the higher the amount of memory capacity overhead incurred.
More advanced memory controllers supplement their EDC scheme with a “chipkill” capability, in which the data within an entire memory device, such as a DRAM, may be ignored, or “erased,” and then recreated using the EDC data. Such capability allows an entire device to be fail while maintaining the capability to fully recover the data. Further, some memory systems may also provide one or more spare memory devices to be used as replacements for other failing memory devices. However, similar to the use of EDC, the use of spare devices also increases the cost and memory overhead associated with the memory system. Other systems may supply a spare DIMM for replacing an entire in-use DIMM that includes one or more memory defects affecting large portions of the DIMM. In yet another example, the memory controller itself may include a small amount of storage to replace one or more defective cache “lines” of data stored in the memory devices. In other implementations, computer system firmware may report a defect detected by the EDC scheme to an operating system (OS), which may then deallocate one or more constant-sized OS-level “pages” of memory containing the defect.
Even with these advanced memory protection mechanisms, further memory technological advances often involve attendant increases in hard and soft error rates, thus reducing device reliability. Also, new memory device generations often introduce new memory failure modes. For example, memory defects previously causing one or two memory cells to fail may instead affect four or eight memory cells. Thus, such advances in memory technology may have the unintended effect of reducing the effectiveness of the EDC and related schemes currently employed in computer memory systems.