1. Field of the Invention
This invention relates generally to Error Correcting Code (ECC) techniques and particularly to an ECC method in which data words are divided into multiple domains for ECC purposes.
2. Description of the Prior Art
The use of error correction and detection techniques when transmitting or storing binary data is of vital importance to ensure data integrity in digital data processing systems. In any digital system, noise in the channel between transmitter and receiver can introduce errors, such that individual bits may be inverted and an improper message received. Linear block codes have been devised to detect and correct errors to improve data integrity. Using these codes, the transmitted message consists of information bits, and some number of parity bits, or check bits. The check bits are calculated and generated at the transmitter. Check bits are transmitted with the actual information bits and are decoded by the receiver.
Syndrome bits are generated at the receiver by decoding the received information and check bits. Using the syndrome bits it is possible to determine whether one or more errors have occurred and, for some codes, the bit positions in the binary word at which the errors occurred. The number of errors which may be detected and/or corrected depends upon the code used.
Hamming codes have been determined to be the most efficient in terms of the least number of parity bits for a given number of information bits, and they are commonly used in data processing systems. Using a Hamming code with Hamming distance of 3 (i.e., each word in the code, data bits and check bits, is different from any other word in at least 3 different bit positions) single bit errors can be corrected and double bit errors can be detected. If it is necessary to implement more than single bit error correction, then the Hamming distance of the code must be increased. The error correction capability of a code is given by the following formula:
Error Correction Capability =[(Dmin -1)/2]. where Dmin is the minimum Hamming distance. The brackets [] denote the integer part of (Dmin -1)/2. From this equation, it can be seen that a minimum Hamming distance of 4 also will correct single-bit errors and detect double bit errors.
Data processing and computer systems typically use a modified Hamming code of distance four for error correction and detection. To increase the Hamming distance of a code, and thereby increase the error correction capability, it is necessary to increase the number of check bits.
An error correction scheme employed in U.S. Pat. No. 4,817,091 to Katzman, et al., is a typical example of an error correction code. In this exemplary system, a 16-bit data field is protected by a 6-bit check field. The encoding scheme used is a modified Hamming code of distance four, wherein each data bit is protected by three check bits. That is, an error in a data bit causes three of the six check bits to change state. When the syndrome is computed, comparing the old check bits against the new check bits, the syndrome will have an odd parity. This indicates that a single bit error has occurred, and the bit position of the error location may be found easily by consulting the error code generation table. The check bits apply across the 16-bit word, and both data bits and check bits are located on the same physical memory array.
It is possible to perform a second level of ECC in a data processing system. U.S. Pat. No. 4,745,604 to Patel, et al., teaches a two-level ECC that is used for data stored on a disk drive. Data is divided into subblocks and each subblock is assigned a first level ECC. In addition, a second-level ECC is defined for the entire block, including the subblocks and the first-level ECC bits. This method requires extra time for computation of the second-level ECC, since each level is computed sequentially.
When designing a fault-tolerant memory system, it is desirable to consider the effects of word size, error correction capability, random access memory (RAM) failure modes and the Mean Time Between Failures (MTBF). An analysis of MTBFs for different word sizes shows an inverse relationship between word size and MTBF. Larger word sizes result in the storage or transmission of a greater number of bits, increasing the probability that an error will occur in at least one bit-position of the word. If the number of errors which are to be corrected is increased from one to two, more check bits are needed for the same number of information bits. This increase in the volume of data stored (or transferred) may actually decrease the MTBF.
Field studies of the failure modes of dynamic random access memories (DRAMS) show that failures of full integrated circuits (ICs) have a substantial influence on MTBF. Single bit failures occur more frequently than full IC failures, but the failure of an isolated bit is easily corrected by a single-bit error correcting code. The failure of a full IC typically results in a greater number of incorrect bits than ECC methods can correct. Because of the number of bits in a DRAM, the failure of an entire IC accounts for many system failures.
U.S. Pat. No. 4,747,080 to Yamada relates to a semiconductor memory having a self-correction function. The memory array has redundant data cells. Both horizontal and vertical error checking are performed. The individual memory cells are arranged in groups such that no two cells in a group have the same horizontal or the same vertical parity bit. When both the horizontal and vertical parity checks have been performed, an erroneous cell can be located at the intersection of the horizontal and vertical parity check values. When an erroneous cell is located, it is replaced by one of the spare cells. Although a system employing self-correcting memory arrays is protected against single bit failure, the failure of the entire IC cannot be easily corrected in this system.