This invention relates generally to error correction. In particular, it relates to the correction of a number of hard errors beyond the unextended capability of the error correction code being used.
Error correcting codes (ECC) have been routinely used for fault tolerance in computer memory subsystems. The most commonly used codes are the single error correcting (SEC)-double error detecting (DED) codes capable of correcting all single errors and detecting all double errors in a code word. These SEC-DED codes are most effective in protecting memory data when the memory array chips are configured in one-bit-per-chip with respect to the ECC words.
As the size of computer memories has increased while the individual memory cells have become further miniaturized, there has resulted an unacceptable occurrence of bit errors in data stored in a memory. No longer can an occasional error be allowed to cause a program to stop operating or require replacement of a memory chip. These bit errors are of two general types, soft errors and hard errors. A soft error is a seemingly random inversion of stored data. This inversion is caused by occasional bursts of electrical noise and, in some cases, by atomic particles, the so-called alpha particle upset. The soft errors problem has increased as the individual cell sizes have been reduced so that noise levels represent relatively low amounts of power.
A hard error, in contrast, represents a permanent electrical failure of the memory chip, often restricted to particular memory locations but also sometimes associated with peripheral circuitry of the memory chip so that the entire chip can be affected. Naturally, designers of memory chips have strived to reduce the occurrence of both hard and soft errors in their chips. However, both types of errors have not been completely eliminated and, indeed, it is not believed that they can be eliminated. Reliability beyond a certain point can be bought only at the expense of reduced performance or increased cost.
An alternative to the above solution for both hard and soft errors has been the implementation of ECC in large computer memories. The fundamentals of error detecting and correcting are described by R. W. Hamming in a technical article titled xe2x80x9cError detecting and error correcting codesxe2x80x9d appearing in the Bell System Technical Journal, Volume 26, No. 2, 1950 at pages 147-160. In one of the most popular Hamming codes, an 8 bit data word is encoded to a 13-bit word according to a selected Hamming code. The described code is classified as SEC-DED. However, since a SEC-DED Hamming code can correct only a single random error (either soft or hard) occurring in any byte, more elaborate error correcting codes have been developed and implemented.
In particular, a better procedure has been developed for dealing with hard errors in multi-bit packages. These errors are referred to as package errors and the error correction codes designed specifically for package errors will be called package codes. The codes rely upon the fact that multiple hard errors do not randomly occur across the entire field of the data word. Instead, multiple hard errors are confined to a sub-field of the data word, affecting up to all the bits associated with the memory package and defined by the outputs of the package. In the context of a 4Mxc3x974 memory chip, consisting of a 16 megabit memory chip adapted to have 4 data ports simultaneously accessible, such a code cannot correct any four errors occurring in 32 bits. However, the code can correct four-bit errors that occur in any one of eight 4-bit sub-fields.
Nonetheless, even such codes are not completely satisfactory. If the code is a SPC-DPD (single package correct/double package detect) code, then the code can correct any errors that occur in only one package and can detect, but not correct, errors occurring in two packages. Thus if one package has suffered a hard failure, the occurrence of any additional errors, either soft or hard, in the remaining packages means that the error condition can be detected but the errors cannot be corrected. The existence of one hard failure is the effective limit of correction provided by a SPC-DPD code.
U.S. Pat. No. 4,661,955 (""955) discloses an extended error correcting device and method for SPC-DPD codes that is capable of correcting both a single soft error in one package and hard errors in another package. In the disclosed device and method, if the initial pass of the data through the error correction code indicates an uncorrected error, the data is complemented and restored in the memory and then reread. The reread data is recomplemented and again passed through the error correction code. The complementing, storing, retrieving, recomplementing, and ECC of the data is known as a xe2x80x9ccomplement/recomplementxe2x80x9d (comp/recomp) or an xe2x80x9cinvert and retryxe2x80x9d procedure. If an uncorrected error persists after the comp/recomp, then a bit-by-bit comparison is performed between the originally read data and the retrieved complemented data to isolate the hard failure in the memory. The bits in the sub-field associated with the hard failure are sequentially changed and the changed data word is passed through the error correction code. A wrong combination is detected by the error correction code. The sub-field associated with the hard failure matches the originally stored data, in which case the error correction code can correct the remaining errors in the remaining sub-fields. However, the successive changes of the bits in the sub-field associated with the hard failure involve a long process of iterations. Moreover, this system has the disadvantage of involving a long process of bit by bit comparisons between the originally read data and the retrieved complemented ones, numerous compare circuits and latches, and a non-fixed length correcting sequence since the originally stored data in a sub-field associated with the hard fail can be any of the 16 different combinations.
U.S. Pat. No. 4,961,193 (""193), like the aforementioned ""955, describes an extended error correcting device and method for SPC-DPD codes that is capable of correcting both a single soft error in one package and hard errors in another package. However, unlike the aforementioned ""955, the device and method described in ""193 does not use a bit-by-bit method. In the ""193 device and method, if the initial pass of the data through the error correction code indicates an uncorrected error, the syndrome of the data is stored and a complement/recomplement procedure is performed. If an uncorrected error persists after the comp/recomp procedure, the syndrome of the data is added to the syndrome of the complemented data. This sum is checked to see if it is a double package error (DPE) and, therefore, uncorrectable. If this sum is not a DPE, it is then matched to values in a table. Given the sum, the table provides the package to correct and the bits in error within the package. Using this data, the hard errors can be corrected. While this method avoids the use of a bit-by-bit process, it requires the use of a comp/recomp procedure and cannot be used with other processes known in the art, such as the read/write pattern test or the reference of data collected from past history.
In an embodiment of the present invention, a digital n-bit error correction-coded word includes a plurality of b-bit packages. The n-bit word is received from a data source having a faulty element, and an error correction code is performed on the n-bit word to correct a number of errors in the word. In addition to correcting the number of errors, the error correction code generates a syndrome for the word, and detects a number of errors in excess of the errors that it can correct. The position of the package in which the detected but uncorrected errors are located is then determined. Using the position of the package and the syndrome, an error pattern is determined. The errors in the n-bit word are then corrected using this pattern.