The present invention relates generally to computer memory, and more specifically, to stale data detection in a marked channel for a scrub in a computer memory
Memory device densities have continued to grow as computer systems have become more powerful. With the increase in density comes an increased probability of encountering a memory failure during normal system operations. Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. One detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
The parity technique may be extended to not only detect errors, but to also correct errors by appending an XOR field, i.e., an error correction code (ECC) field, to each data, or code, word. The ECC field is a combination of different bits in the word XOR-ed together so that some number of errors can be detected, pinpointed, and corrected. The number of errors that can be detected, pinpointed, and corrected is related to the length of the ECC field appended to the data word. ECC techniques have been used to improve availability of storage systems by correcting memory device (e.g., dynamic random access memory or “DRAM”) failures so that customers do not experience data loss or data integrity issues due to failure of a memory device.
Redundant array of independent memory (RAIM) systems have been developed to improve performance and/or to increase the availability of storage systems. RAIM distributes data across several independent memory modules that each contain one or more memory devices. There are many different RAIM schemes that have been developed each having different characteristics, and different pros and cons associated with them. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold customer data) vary across different RAIM schemes. Improvements in one attribute may result in reductions in another.
One method of improving performance and/or reliability in memory systems is to mark individual memory chips as potentially faulty. In addition, when an entire memory channel fails, the channel itself may be marked as faulty. Channel marking allows the RAIM system to ignore a single channel (e.g., one out of five) during the ECC decoding and correcting phase of a fetch to improve correctability of the data. The channel mark guards against detected catastrophic channel errors, such as bus errors that cause bad cyclic redundancy check (CRC) or clock problems using software and/or hardware logic.
The software and/or hardware logic also supports two DRAM chip marks which are applied on a per-rank basis to guard against bad chips. These DRAM marks are used to protect the fetch data against chip kills (i.e., chips that have severe defects). However, if there is an overabundance of DRAM errors in a rank, the DRAM marks may not be sufficient to repair the chip errors. This increases the possibility for uncorrectable errors if additional chips fail after the two chips of that rank are marked. In addition, certain calibration errors can cause a high rate of channel errors that could lead to uncorrectable errors. If this happens, any number of DRAMs may be affected, causing DRAM mark availability to be limited.
Examples of RAIM systems may be found, for instance, in U.S. Patent Publication Number 2011/0320864, titled “Heterogenous Recovery in a Redundant Memory System”, filed on Jun. 24, 2010, the contents of which are hereby incorporated by reference in its entirety; in U.S. Patent Publication Number 2011/0320869, titled “Heterogenous Recovery in a Redundant Memory System”, filed on Jun. 24, 2010, the contents of which are hereby incorporated by reference in its entirety; and in U.S. Patent Publication Number 2012/0173936, titled “Channel Marking for Chip Mark Overflow and Calibration Errors”, filed on Dec. 29, 2010, the contents of which are hereby incorporated by reference in its entirety.