As the size of computer memories has increased while the individual memory cells have become further miniaturized, there has resulted an unacceptable occurrence of bit errors in data stored in a memory. No longer can an occasional error be allowed to cause a program to stop operating or require replacement of a memory chip. These bit errors are of two general types, soft errors and hard errors. A soft error is a seemingly random inversion of stored data. This inversion is caused by occasional bursts of electrical noise and, in some cases, by atomic particles, the so-called alpha particle upset. The soft errors problem has increased as the individual cell sizes have been reduced so that noise levels represent relatively low amounts of power. A hard error, in constrast, represents a permanent electrical failure of the memory chip, often restricted to particular memory locations but also sometimes associated with peripheral circuitry of the memory chip so that the entire chip can be affected. Naturally, designers of memory chips have strived to reduce the occurrence of both hard and soft errors in their chips. However, both types of errors have not been completely eliminated and, indeed, it is not believed that they can be eliminated. Reliability beyond a certain point can be bought only at the expense of reduced performance or increased cost.
An alternative to the above solution for both hard and soft errors has been the implementation of error correction codes (ECC) in large computer memories. The fundamentals of error detecting and correcting are described by R.W. Hamming in a technical article entitled "Error Dectecting and Error Correcting Codes" appearing in the Bell System Technical Journal, Volume 26, No. 2, 1950 at pages 147-160. In one of the most popular Hamming codes, an 8 bit data word is encoded to a 13-bit word according to a selected Hamming code. A decoder can process the 13-bit word and correct any 1 bit error in the 13 bits and can detect if there are 2-bit errors. The described code, thus, is classified as SEC/DED (single error correct/double error detect). The use of such codes has been particularly efficient for memory chips having single-bit outputs. For instance, if a relatively simple computer were to have 16K (16,348) bytes of data where each byte contains 8 data bits, then an efficient errorprotected design would use thirteen 16K.times.1 memory chips with the extra five 16K chips providing a Hamming SEC/DED protection. The Hamming code can correct only a single random error occurring in any byte but can further correct for any one failed 16K memory chip since any one memory chip contributes only 1 bit per each error-protected word.
Of course, the described 13-bit Hamming code could only correct one error, whether it be a hard error or a soft error. As a result, if one memory chip has suffered a hard failure in all its locations, then the remaining chips are not protected against an occasional soft error although it could be detected but not corrected. For this and other reasons, more elaborate error correcting codes have been developed and implemented. As a general rule, the more errors that can be corrected in a word, the more extra bits are required for the check code.
Hamming codes and similar codes are thus useful for correcting hard errors when each bit of data word is stored in a different memory chip. However, the trend has been toward memory chips of ever increasing density. Single memory chips having 1 megabit of capacity, or even 4 megabits, will soon become commercially available. However, many systems do not require 1 megabyte or 4 megabytes of storage. This large amount of storage would result when these larger chips are used to store only a single bit of every data word. As a result, it is anticipated that many systems will use the larger memory chips to store multiple bits of the same data word. For instance, a 1 megabit memory chip can be easily adapted to have 4 data ports, each simultaneously accessible. This chip would then be properly designated as a 256K.times.4 memory. Eight of these 1 megabit chips would then provide 256K of storage for 32 bit words. Error protection will likely need to be provided for such a large memory based upon such a dense memory chip. Soft errors, of themselves, do not present much of a problem because of their random occurrence in one or a few memory locations. Hard errors, however, present a much more difficult problem for multi-bit memory chips. The problem is that the hard error is often not restricted to a single bit output port but affects all the bits associated with the memory package. In the 256K.times.4 package described above, this means that a hard error is likely to produce four simultaneous errors in the same word. Error correction codes are available for handling this large number of errors and, indeed, error correction codes can be developed for almost any number of errors in a word. However, such codes require a large number of extra bits to perform such large scale correcting.
Recently, a better procedure has been developed for dealing with hard errors in multi-bit packages. These errors will be referred to as package errors and the error correction codes designed specifically for package errors will be called package codes. These codes rely upon the fact that the multiple hard errors do not randomly occur across the entire field of the data word. Instead, the multiple errors are confined to a sub-field of the data word, defined by the outputs of the package. In the context of the previously described example, such a code cannot correct any four errors occurring in the 32-bit word. However, the code can correct four-bit errors that occur in any one of the eight 4-bit sub-fields.
Nonetheless, even such codes are not completely satisfactory. If the code is a SPC/DPD (single package correct/double package detect) code, then the code can correct any errors that occur in only one package and can detect, but not correct, errors occurring in two packages. Thus if one package has suffered a hard failure, the occurrence of any additional errors, either hard or soft, in the remaining packages means that the error condition can be detected but the errors cannot be corrected. Thus, the existence of one hard failure is the effective limit of correction provided by a SPC/DPD code.
Woo in U.S. Pat. No. 3,449,718 has proposed a novel method of correcting errors, particularly applicable to magnetic tapes. Magnetic tapes have parallel tracks with the parallel locations defining a byte. One of the tracks or parallel bits is dedicated to a check bit that records the parity of the remaining bits of the byte. The bytes are further arranged on the tape in fairly long blocks. At the end of a block, there are several additional recorded bytes that provide a block check. That is, the bits of the block check must be consistent with the previous data and check bits. The parity check bits can only detect, but not correct, a single error. In Woo's apparatus the assumption is made that all errors within a block occur in a single track, an assumption that is likely for magnetic tape recording. Woo then reads the tape with the assumption that the errors are occurring in the first track. He reads the tape, byte by byte, and checks for parity errors. If a parity error is indicated within a particular byte, the bit in the first track is inverted. Data, with possible inversions, is then subjected to comparison with the block check bits. If the initial assumption was correct, the data was corrected and the block check bits indicate correct data. However, if the initial assumption were incorrect, then the assumption is changed to one that all errors are occurring in the second track. The data is then reread from the tape with parity errors causing inversion in the second track. The process is repeated until agreement with the block check bits is obtained or until after the tape has been read with each track being separately assumed to contain errors. An example of package codes for detecting package errors is disclosed by Kaneda et al in a technical article entitled "Single Byte Error Correcting--Double Byte Error Detecting Codes for Memory Systems" appearing in IEEE Transactions on Computers, vol. C-31, No. 7, July 1982 at pp. 596-602. Further examples of package codes and their hardware implementation are disclosed by Bossen in U.S. Pat. No. 3,629,824 and by Chen, one of the present inventors, in U.S. Pat. No. 4,464,753. Closely related to Chen's patent is U.S. Pat. No. 4,319,357 to Bossen which discloses the use of SEC/DEC code to correct double errors, one of which is a hard failure. The storage location of the erroneous word is checked for a stuck bit. A new syndrome is calculated and compared with the syndrome of the erroneous word to locate the transitory error.
Hsiao et al, in a technical article entitled "Double-Error Correction" appearing in the IBM Technical Disclosure Bulletin, Vol. 14, No. 4, September 1971 at p. 1342 discloses a method of using a single error correcting (SEC) code to correct a double error by consecutively inverting bits until only a single error remains, which can be corrected by the code. However, this technique provides no localization of the errors. Furthermore, the inversion of a correct bit produces three erors which the code must unambiguously detect as an error.
Scheuneman et al in U.S. Pat. No. 4,139,148 disclose a method of correcting double errors with a single error correcting (SEC) code. Whenever a single error is detected, the syndrome bits used to correct the error are stored. If, subsequently, the same word produces a detection of a double error, the previously stored syndrome is used to correct one of the errors. Bossen et al in a technical article entitled "A System Solution to the Memory Soft Error Problem" appearing in IBM Journal of Research and Development, vol. 24, No. 3, May 1980 at pp. 390-397 discuss the problem of simultaneous hard and soft errors. One solution, applicable to a subclass of errors, is to complement the read data, store the complement in the original memory location, reread the complemented data and recomplement it. In some but not all combinations of hard failures and data, this complement and retry technique overcomes these hard failures.
Carter in U.S. Pat. No. 3,949,208 also describes an extended error correction technique utilizing complement and retry.