The present invention relates generally to computer memory and more particularly to error correction in a memory system.
Computer systems often require a considerable amount of high speed random access memory (RAM) to hold information, such as data and programs, temporarily when a computer is powered and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of eight bits; more generally these groups or bytes are called symbols and may be made up of any number of bits or sub-symbols.
Memory device densities have continued to grow as computer systems have become more powerful. In some cases, the RAM content of a single computer can be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause system-wide issues. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full memory module failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems. In systems with an array of memory modules (servers, for example), failed memory modules may be isolated temporarily without taking system down, in order to sustain the system operation. However, this would result in memory loss from the overall system memory and would adversely affect the performance.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow.