This invention relates generally to computer memory and more particularly, to channel marking for chip mark overflow and calibration errors in a memory system.
Memory device densities have continued to grow as computer systems have become more powerful. With the increase in density comes an increased probability of encountering a memory failure during normal system operations. Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
Richard Hamming recognized that the parity technique could be extended to not only detect errors, but to also correct errors by appending an XOR field, an error correction code (ECC) field, to each data, or code, word. The ECC field is a combination of different bits in the word XOR-ed together so that some number of errors can be detected, pinpointed, and corrected. The number of errors that can be detected, pinpointed, and corrected is related to the length of the ECC field appended to the data word. ECC techniques have been used to improve availability of storage systems by correcting memory device (e.g., dynamic random access memory or “DRAM”) failures so that customers do not experience data loss or data integrity issues due to failure of a memory device.
Redundant array of independent memory (RAIM) systems have been developed to improve performance and/or to increase the availability of storage systems. RAIM distributes data across several independent memory modules (each memory module contains one or more memory devices). There are many different RAIM schemes that have been developed each having different characteristics, and different pros and cons associated with them. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold customer data) are perhaps the most important. The tradeoffs associated with various schemes have to be carefully considered because improvements in one attribute can often result in reductions in another.
One method of improving performance and/or reliability in memory systems is to “mark” individual memory chips as potentially faulty. In addition, when an entire memory channel fails, the channel itself can be marked as faulty. Channel marking is a way of ignoring a single channel (one out of five) during the ECC decoding and correcting phase of a fetch to improve correctability of the data. The intent of this channel mark is to guard against detected catastrophic channel errors, such as bus errors that cause bad cyclic redundancy check (CRC) or clock problems using software and/or hardware logic.
The software and/or hardware logic also supports two DRAM chip marks which are applied on a per-rank basis to guard against bad chips. These DRAM marks are used to protect the fetch data against chip kills (those chips that have severe defects). However, if there is an overabundance of DRAM errors in a rank, the DRAM marks may not be sufficient to repair the chip errors. This increases the possibility for uncorrectable errors if additional chips fail after the two chips of that rank are marked.
In addition, certain calibration errors can cause a high rate of channel errors which could lead to uncorrectable errors. If this happens, any number of DRAMs may be affected causing DRAM mark availability to be limited.