1. Field of the Invention
The invention relates to error detection and correction systems, and more particularly, to error detection and correction circuits for storage and retrieval of data in computer memories.
2. Description of the Related Art
Technology has advanced in recent years to provide faster and more powerful computers, and as the technology has progressed, programmers have developed applications to exploit the improved performance. As a result, modern applications perform a wide array of tasks, yet the programs are often quite user friendly due to advances in user interface systems and graphics-oriented applications.
Performance and user friendliness, however, have not been achieved without sacrifice. Versatile and friendly programs often require tremendous memory resources in the computer system to store the program and data. Consequently, computer systems generally include several Megabytes of random access memory (RAM) in which the microprocessor stores programs and data, and then reads the appropriate portions of memory as the program progresses.
To operate properly, of course, the data conveyed to and read from memory must be an accurate copy of the stored data. An assortment of factors, like faulty components or inadequate design parameters, may cause errors in the data used by the computer. As a memory system grows, more components are present and subject to failure, and the mean time between failures (MTBF) usually diminishes. Thus, in a large memory array, the potential frequency of errors becomes a significant hazard, and the errors are almost impossible to prevent.
To preclude corrupted data from use, manufacturers incorporate error detection and correction circuitry into computer memory systems. Numerous methods have been developed and implemented, but the simplest and most well-known error detection code is the single-bit parity code. To implement a parity code, a single bit is appended to the end of the data word stored in memory. For even parity systems, the value of the parity bit is assigned so that the total number of ones in the stored word, including the parity bit, is even. For odd parity, the parity bit is assigned so that the total number of ones is odd. When the stored word is read, if one of the bits is erroneous, the total number of ones in the word must change so that the parity value for the retrieved data does not match the stored parity bit. Thus, an error is detected by comparing the stored parity bit to a regenerated check bit calculated for the data word as it is retrieved from memory.
Although a single-bit parity code effectively detects single-bit read errors, the system has limits. For example, if two errors occur, the parity value for the data remains the same as the stored parity bit, because the total number of ones in the word stays odd or even. In addition, even though an error may be detected, the single-bit parity code cannot determine which bit is erroneous, and therefore cannot correct the error.
To provide error correction and more effective error detection, various error correction codes were developed which not only determine that an error has occurred, but also indicate which bit is erroneous. The most well-known error correction code is the Hamming code, which appends a series of check bits to the data word as it is stored. When the data word is read, the retrieved check bits are compared to regenerated check bits calculated for the retrieved data word. The results of the comparison indicate whether an error has occurred, and if so, which bit is erroneous. By inverting the erroneous bit, the error is corrected. In addition, a Hamming code detects two-bit errors which would escape detection under a single-bit parity system. Hamming codes can also be designed to provide for three-bit error detection and two-bit error correction, or any other number of bit errors, by appending more check bits. Thus, Hamming codes commonly provide greater error protection than simple single-bit parity checks.
Unfortunately, Hamming codes require several check bits to accomplish the error detection and correction. For example, an eight-bit data word requires five check bits to detect two-bit errors and correct one-bit errors. As the bus grows wider and the number of bits of transmitted data increases, the number of check bits required also increases. Because modern memory buses are often 64 or 128 bits wide, the associated Hamming code would be very long indeed, requiring considerable memory space just for the check bits. Consequently, using Hamming codes in large memory systems is expensive and consumes substantial memory resources.
A further problem is caused by modern RAM chips. In early memory systems, RAM chips were organized so that each chip provided one bit of data for each address. Current RAM chips, however, are frequently organized into sets of four bits of data for each address. If one of these RAM chips fails, the result is four potentially erroneous data bits. Unless the error correction code is designed for four-bit error detection or correction, a four-bit error may go completely undetected. Incorporating a four-bit error detection and correction code in a 64-bit or 128-bit memory system, however, would require numerous check bits and a substantial portion of the memory space. Consequently, to detect errors caused by a RAM chip failure while a program is in progress, designers have been forced to employ lengthy, memory-consuming check bit schemes, or simply hope that the erroneous data causes a system error or failure before any significant damage is done.
After the error is finally detected, the source of the error must be identified. Because the error may be undetected until a system error or failure occurs, the location of the faulty DRAM cannot ordinarily be determined without a hardware test of each memory module. In a large memory system, testing each individual module for operability is prohibitively costly due to repair costs and computer system down time, particularly if the error is intermittent.