This invention relates to fault detection in computer systems and, more particularly, to a method and apparatus for detecting memory failures using error detecting/correcting codes.
Many techniques are known for detecting and/or correcting data errors which occur in computer systems. Such errors typically occur because of hardware failure or because of external influences such as electromagnetic radiation, subatomic particles, etc. The errors ordinarily occur as inverted data bits within the datum. That is, an error appears as a "1" which should be a "0" and vice versa. A common error detection technique includes the use of parity bits embedded within the datum. To determine the value of the parity bit, the number of "1"s in the datum (not including the parity bit) are counted, and it is determined whether the sum is even or odd. In even parity systems, the total number of "1"s in the datum (including the parity bit) must always be even. Thus, if the total number of "1"s in the datum, less the parity bit, is odd, then the parity bit is set to "1", whereas if the total number of "1"s in the datum is even, then the parity bit is set to "0". In odd parity systems, the parity bit is set to ensure that the total number of "1"s in the datum, including the parity bit, is odd. Of course, this simple parity bit scheme cannot detect an even number of errors, so a datum with 2, 4, 6, etc., errors will appear error-free. Furthermore, the location of the erroneous bit is unknown, so the error cannot be corrected.
More advanced systems are available which locate the position of single bit errors and which detect the existence of multiple bit errors. See, for example, R. W. Hamming, "Error Detecting and Error Correcting Codes," Bell System Tech. J. 29, 147(1950); W. W. Peterson, Error Correcting Codes, MIT Press, Cambridge, Mass., 1961; and M.Y Hsiao, "A Class of Optimal Minimum Odd-Weight-Column SEC-DED Codes" IBM J. Res. Develop., July 1970, p. 395. In a typical system employing one of these codes, the datum is encoded to produce a first plurality of check bits (e.g., 8 bits) which are appended to the datum. The value of each check bit is determined from a unique subset of bits of the datum in the same manner described for the calculation of parity bits. When the datum is later read, the subsets are combined with their respective check bits to produce a plurality of bits termed a "syndrome." If the syndrome bits are zero, the datum is free of errors. However, if the syndrome bits are non-zero, then an error exists somewhere in the datum. The value of the syndrome bits may be used for locating and correcting single bit errors.
An example of such a system constructed for a 64-bit datum is shown in the Hsiao publication noted above. In this system, referred to as odd-weight-column, single error correcting and double error detecting (SEC-DED) codes, different subsets of the 64 data bits are used to generate 8 check bits. The system can detect single and double bit errors, and the check bits may be used to correct single bit errors. Unfortunately, the system disclosed can mistake three-bit errors as single bit errors, and some four-bit errors go undetected.
When a memory system is constructed using 1-bit-wide memory modules, the most likely errors to occur are single bit errors, since each module contributes only one bit to the datum and therefore to the check bit calculation. This simplifies error correction. However, with the advent of four-bit wide memory modules (e.g., DRAMS), a complete failure of a memory module appears as a 1, 2, 3- or 4-bit error where the bits in error are all in the same four-bit nibble of the datum. Thus, memory system integrity would be greatly improved if there were a way to detect 3- and 4-bit errors within any nibble of the datum, while still detecting all single and double bit errors and correcting single bit errors.