A vital component of virtually all computer systems is a semiconductor or solid-state memory system. Such memory often holds both the programming instructions for a processor of the computer system, as well as the data upon which those instructions are executed. In one example, the memory system may include one or more dual in-line memory modules (DIMMs), with each DIMM carrying multiple dynamic random access memory (DRAM) integrated circuits (ICs). Other memory technologies, such as static random access memories (SRAMs), and other memory organizational structures, such as single in-line memory modules (SIMMs), are also employed in a variety of computer systems. In addition, one or more processors may be coupled with the memory modules through a memory controller, which translates data requests from the processor into accesses to the data held in the memory modules.
Computer systems have benefited from the ongoing advances made in both the speed and capacity of memory devices, such as DRAMs, employed in memory systems today. However, increasing memory data error rates often accompany these advancements. More specifically, both “hard errors” (permanent defects in a memory device, such as one or more defective memory cells) and “soft errors” (data errors of a temporary nature, such as inversion of data held within one or more memory cells) tend to become more prevalent with each new technology generation.
Some of these memory defects within individual memory devices are discovered during the manufacturing process by way of test equipment writing multiple data patterns to each of the device memory locations, reading the data back, and comparing the data read with the data written. If the test equipment detects a defective memory location, the device may be discarded. In other cases, the device may incorporate one or more spare memory locations configured to replace the defective memory locations by way of fusible links programmed via the tester so that memory requests for a defective location are instead redirected within the device to an associated spare location. By incorporating spare memory in this fashion, device mortality at the manufacturing site may be greatly reduced.
However, after the memory device is then placed in an operating computer system, both hard and soft errors may still be encountered during normal use of the device. To combat these errors, memory controllers in commercial computer systems now often support an error detection and correction (EDC) scheme in which redundant EDC data is stored along with the customer, or “payload,” data. When these data are then read from the memory, the memory controller processes the EDC data and the payload data in an effort to detect and correct at least one data error in the data. The number of errors that may be detected or corrected depends in part on how the nature of the EDC scheme utilized, as well as the amount of EDC data employed compared to the amount of payload data being protected. Typically, the more EDC data being utilized, the higher the number of errors being detected and corrected, but also the higher the amount of memory capacity overhead incurred.
More advanced memory controllers supplement their EDC scheme with a “chipkill” capability, in which the data within an entire memory device, such as a DRAM, may be ignored, or “erased,” and then recreated using the EDC data. Such functionality allows an entire device to fail while maintaining the capability to fully recover the data. Further, some memory systems may also provide one or more spare memory devices to be used as replacements for other failing memory devices. However, similar to the use of EDC, the use of spare devices also increases the cost and memory overhead associated with the memory system. Other systems may supply a spare DIMM for replacing an entire in-use DIMM. In yet another example, the memory controller itself may include a small amount of storage to replace one or more memory locations in the memory devices. In other implementations, computer system firmware may report a defect detected by the EDC scheme to an operation system (OS), which may then replace a constant-sized OS-level “page” of memory containing the defect with another memory page previously allocated to the OS.
Even with these advanced memory protection mechanisms, further memory technological advances often involve attendant increases in hard and soft errors rates, thus reducing device reliability. Also, new memory device generations sometimes introduce previously unknown memory failure modes. For example, memory defects previously causing one or two memory cells to fail may instead affect four or eight memory cells. Thus, such advances in memory technology may have the unintended effect of reducing the effectiveness of the EDC and related schemes currently employed in computer memory systems.