The complement/recomplement (C/R) type of error correction was disclosed in U.S. Pat. No. 3,949,208 entitled "Apparatus for Detecting and Correcting Errors in an Encoded Memory Word" to M. C. Carter and assigned to the same assignee as the subject application. The C/R technique has been used to augment the error correction capability of Hamming type ECCs (error correction codes) for data units stored in the memory of a computer system. The C/R technique has been used to correct one or more hard (H) errors in a data unit, leaving the ECC to correct any soft (S) error in the data unit.
A hard error is an error caused by a permanent fault in a circuit, such as a broken wire, and causes a bit position in memory to be stuck permanently in a given state, either a 1 or 0 state. A soft error is usually caused by an alpha particle changing the 0 or 1 state of a circuit, wherein a soft error condition will not exist the next time other data is stored in that circuit. Thus a hard error remains permanently in the hardware while a soft error exists only in the single recording of a data unit. The C/R method corrects only the permanently stuck state of hard errors. The C/R method may be used with computer storage built with semiconductor dynamic random access memory (DRAM) semiconductor chips.
The C/R method is initiated only after the ECC in a data unit finds an excessive error. Then the C/R process reads and complements (inverts) each read bit value in the data unit. Then the C/R process stores the inverted data unit back into the same bit locations in memory. When stored in their original locations, only the erroneous bits in hard error locations revert to their prior stuck states. All non-erroneous bits, and any erroneous bits with soft errors, will be inverted in relation to the stuck bits with hard errors which will not be inverted because of their stuck condition. A second fetch of the stored inverted data unit again inverts the read bits to correct all hard errors; and the ECC is then used only to correct any soft errors up to the maximum capability of the ECC. After this second inversion and at the end of the C/R process, the data unit is again stored in memory in its original location in its original erroneous form.
ECCs (error correcting codes) have been commonly used in DRAM (dynamic random access memory) storage by large computer systems, i.e. main storage (MS) and extended storage (ES). The most commonly used ECC has been for the SEC/DED (single error correction/ double error detection), which can detect, but cannot correct, double-bit errors in any data unit (DU) when the DU is stored or transferred. If a second bit error (an excessive error) is detected in a DU when using such SEC/DED type of ECC, the second erroneous bit cannot be corrected by the ECC. However, the second erroneous bit (the excessive error in a system using SEC/DED) often can be corrected by the C/R method for the transmission of the data, in which the C/R method can correct any number of hard errors (H) and but can only correct a single soft error (S) per DU. Accordingly, the combination of the C/R method and ECC can correct during transmission any number of hard and soft errors in a data unit up to the error detection capability of the ECC.
It is the transient characteristic of soft errors that prevents the C/R method from correcting any soft errors in a data unit. It is the ECC which corrects the soft errors. Hence, the combined C/R and ECC (SEC/DED) methods are limited to correcting one soft error in the transmission of a data unit, and the occurrence of two soft errors (the S-S case) is uncorrectable.
Both of the C/R or ECC methods are limited to correcting stored errors only during the transmission of a data unit. The hard or soft errors existing in the data unit in memory remain in the memory from which the transmission occurred. The C/R method can only correct the complemented (inverted) readout of a memory data unit which is stored with hard errors.
After the successful completion of the C/R process, the stored data unit remains with the same erroneous bits in memory, but the requestor receives a corrected data unit if the number of soft errors does not exceed the ECC capability. The C/R method provides complete error correction if there are no the soft error bits. And the C/R method enables complete error correction after it corrects all hard errors if the soft error bits can be corrected by the ECC. No error correction is obtained if the number of soft errors exceeds the capability of the ECC. For example, two soft errors in a data unit (the S-S error case) are not correctable by the C/R method if the maximum ECC capability is one erroneous bit per data unit.
The C/R method is much slower than the ECC correction process alone, because the C/R method requires two additional fetches and two additional stores in memory. Accordingly, the C/R method is not invoked unless an excessive error is detected. For example when using an SEC/DED type of ECC, only two error bits per data unit can be detected. If only one error is detected (no excessive error exists), it is corrected by the ECC without initiating the C/R method.
Although the C/R method can correct any number of permanent (hard) errors in a data unit, it may be limited by the maximum error detection capability of the ECC, since ECC error detection is used to control the initiation of the C/R method.
The C/R error correction technique has been effectively used in commercial computer systems having SEC/DED (single error correction/double error detection) ECC stored in memory to correct two errors in a data unit; the ECC alone has a maximum ability to correct a single error in a data unit. If the data unit has a hard error bit and a soft error bit (herein called the H-S case), the ECC correction is applied to the single soft error bit after the C/R operation has corrected the hard error bit.
Currently used large computer systems desiring the best type of maintenance store a record of the occurrence of all excessive errors, whether corrected or not. This is because excessive errors are not corrected in memory, even though excessive errors may be corrected for a requestor by using the C/R method to keep a task executing that would otherwise have to quit. Hence excessive error correction by the C/R method is considered outside the normal error correction ability of the system. A C/R corrected data unit is vulnerable to crashing the system if another soft error occurs in it.
Other pertinent art was published in 1980 by Bossen and Hsaio in the IBM Research Journal, May 1980, page 390 entitled "A System Solution to Memory Soft Errors".
Stringent error reporting and accounting have been used to insure closely coordinated system maintenance in prior large computer systems. They report excessive errors to a service processor (SP) in the system which maintains records of all significant error conditions occurring in the system to determine, for example, when to switch a CPU off line to perform maintenance.
Previously, an interruption was required to both the requesting processor and to memory operation before the C/R method could be invoked. C/R invocation occurred in response to the occurrence of error detection which cause both the processor clock and the memory access to be stopped until the processor has recovered, and an interruption signal to be sent to the system service processor (SP). Then the SP interrupted its current program and performed a recovery action on the stopped processor, which was usually to retry the instruction which stopped executing when the processor's clock was stopped. After the recovery action was completed, the SP restarted the processor and the memory resumed access for normal operation. Then the processor issued a re-fetch request that invoked the C/R method for the data unit having the excessive error. If the excessive error existed after operation of the C/R method, another processor stoppage occurred, etc. The next time the processor was restarted, it could record instruction processor damage to the task.
This prior operation of the C/R method was very slow because of the clock-stopping interruption to the requestor and to stopping memory access, and the SP intervention before the C/R method could be invoked, which greatly reduced the efficiency of the system, involving milliseconds instead of the normal CPU microsecond speed for each operation of the C/R method. System performance was further severely degraded due to a machine check interruption causing a loss of all data in the CPU's cache, and a loss of all translations in the CPU's TLB (translation lookaside buffer), adding to the reduction in CPU performance due to the need to refetch all data lost in the cache and to retranslate all addresses lost in the TLB. The program task was ABENDed (abnormally ended) by a machine check interruption if the data was not corrected.
C/R error correction does not correct any hard error or any soft error in the memory itself, even though the C/R method may correct the hard errors and the ECC method may correct a soft error in the group only during its transmission to the requestor. However, an ECC corrected soft error could be corrected in MS by storing the corrected DU+ECC group in its original location, which is sometimes called "scrubbing" the data.