The present invention relates generally to computer memory, and more particularly to providing improved error control in a memory system.
Computer systems often require a considerable amount of high speed random access memory (RAM) to hold information such as operating system software, programs, and other data while a computer is powered on and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of 8 bits; more generally these groups are called symbols and may consist of any number of bits.
Memory device densities have continued to grow as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full memory module failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow.
Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. For example, a data word with an even number of 1's will have a parity bit of 0 and a data word with an odd number of 1's will have a parity bit of 1, with this parity bit data appended to the stored memory data. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
More sophisticated codes allow for detection and correction of errors that can affect groups of bits rather than individual bits. Reed-Solomon codes are an example of a class of powerful and well understood codes that can be used for these types of applications.
These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. Memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles.
Contemporary memory devices are often sensitive to alpha particle hits and cosmic rays causing memory bits to flip. These particles do not damage the device but can create memory errors. These are known as soft errors, and most often affect just a single bit. Once identified, the bit failure can be corrected by simply rewriting the memory location. The frequency of soft errors has grown to the point that it has a noticeable impact on overall system reliability.
Memory error correction codes (also referred to as “error control codes” or “ECCs”) use a combination of parity checks in various bit positions of the data word to allow detection and correction of errors. Every time data words are written into memory, these parity checks are generated and stored with the data. Upon retrieval of the data, a decoder can use the parity bits thus generated together with the data message in order to determine whether there was an error and to proceed with error correction if feasible.
In some cases, error correction techniques are unable to identify errors (“no-correction”) or may identify an error but incorrectly attempt to correct the errors (“mis-correction”). In an example, difficulty may arise in lowering no-correct rates and mis-correct rates for error patterns which are not guaranteed to be correctable by the code given the minimum distance and/or the redundancy of the code.