The present invention relates to the correction of errors in stored data using single error correcting, double error detecting (SEC/DED) code and more particularly, it relates to correcting a double error consisting of one fixed error and one transitory error using an SEC/DED code.
Errors in data stored in memories can be said to have two major causes. One cause is the intrinsic physical failure of memory components which usually produces a permanent or "hard" error. The other cause is non-destructive environmental phenomena which causes errors which for the most part are transient, intermittent or "soft" errors. In present day monolithic memory chips, soft errors due to external or environmental interference, are more prevalent than hard errors resulting from component failures. Laboratory evidence shows that the alpha-particle induced failure rate alone is one to two orders of magnitude higher than the memory chips basic intrinsic failure rate. Furthermore, alpha-induced failure rates usually are higher at higher chip densities indicating that soft errors will be of even greater significance in the future.
Error correcting codes have traditionally been used in improving reliability in the storage area of computers. In the semiconductor memory area, the most widely used codes are single error correcting and double error detecting SEC/DED codes. This class of codes is most effective in memory systems organized on a one bit per card, chip or module basis. With the one bit per card, chip or module organization, multiple errors caused by the failure of a card, chip or module will only cause a single bit error in any encoded data word. The mapping of component failures into single bit errors has resulted in the use of delayed system maintenance procedures that allow component failures to accumulate to some threshold level before they are repaired or replaced. This is done under the assumption that the errors caused by the component failures will be corrected by the SEC/DED error correction system. However, with the high levels of soft errors found in modern semiconductor memories, these system maintenance strategies cause a high uncorrectable error (UE) rate as a result of double errors caused by the lining up of a hard error from a component failure with a soft error resulting from alpha particle bombardment. To reduce the number of uncorrectable errors, maintenance strategies could be altered so that fewer component failures are allowed to accumulate. This of course causes higher replacement rates for storage components, and thus higher parts and service costs. Alternatively double error correcting (DEC) codes could be substituted for the SEC/DED codes to correct multi-bit errors resulting from the lining up of a hard error with a soft error. However this would require greater redundancy in the encoded data words and more complex encoding and decoding circuits. Therefore, both these solutions exact a heavy overhead burden and should not be resorted to unless absolutely necessary.
In the past single error double detecting error codes have been used to correct more than one error. For instance, U.S. Pat. No. 3,656,107, suggests that bits in the data word with the double error be changed one after another until the syndrome indicates that the double error is eliminated. Then a new syndrome is generated to locate the single remaining error. In addition, U.S. Pat. No. 4,139,148 describes correcting double bit errors using a SEC/DED code by saving of previously occurring single error syndromes and using them to correct one error bit when double bit error occurring in the same word in storage. Furthermore, it has been suggested that an error correction scheme which involves saving the data word and syndrome when the data word contains too many errors for the error correcting code to correct. The memory location where the data word was stored is then checked using ancillary correction means to locate stuck bits in the data word location. A syndrome is thereafter generated for a data word with stuck bits in the positions located by the ancillary process. If this syndrome matches the stored syndrome the word is corrected by inverting the data bits in the stored data word at the stuck bit positions. If it doesn't the suggestion is that the actual bits among the faulty bits in error can be determined by "iterative partitioning" of the faulty bits and determining the error syndrome for each portion.