Error correction codes were invented to ensure data integrity in a computer system. Error correction, in general, is performed by providing at least an extra verification bit, known as a parity bit, at the end of n block of binary digits or bits, so that the data bits will contain an even (even parity) or an odd (odd parity) number of 1 bits. At the receiving end, the complete block is checked and if the proper number of 1's (even or odd) is not provided, an error is detected.
There are various ways of performing error detection and correction by utilizing Error Correction Code (hereinafter EC code.) In more sophisticated computer systems, several parity checker bits are dedicated to error detection and correction, with each parity bit being assigned to a certain block of data in order to pinpoint the exact location of potential problems areas with more accuracy.
A more popular EC code used is the Hamming Code. Hamming Code adds several check or verification bits to the end of each n (four to eight) block of data bits. In this way a parity byte can include, for example, 8 bits of Hamming code. When recalculated by the receiving device, these check bits can be used to determine whether each of the n block of data bits was received correctly, and they can, in some circumstances, be used to correct erroneous bits. Hamming code allows for single bit failure error detection and correction as well as two bit failure error detection and correction. Even some three bit failures will also be detected.
For example, in a word of thirty-two bits, composed of four eight bit bytes having a parity checker bit for each byte, single and double bit failures in the data bits as well as the parity bit will be detected, with correction for single bit failures.
Most system designs today have implemented a form of the Hamming code for the hardware (storage and/or memory controller) to detect and correct single bit failure and to detect two bit failures.
Error types can be classified into hard or soft categories. A hard array error is a broken component that if retried retains its broken state. A soft array error occurs when the broken state can be changed by rewriting the data to the array.
Another type of error occurs during a cache castout to store. This error is caused when an error is experienced during the store so that at the end of the store it is uncertain as to whether the data contained in the cache is viable data. To detect this condition Hamming Code utilizes a special error correction byte known as a Special Uncorrectable Error (SUE) byte. A SUE stores a special encoded syndrome that is recognized by the Hamming Code as an uncorrectable error. When an error has been detected in storing a block of master data, under this method, integrity of data is protected by storing the special error syndrome noting an uncorrectable error in the address of the store. In this manner, any further access or fetching of the address containing the defective data is prevented by storing a special syndrome noting uncorrectable error in that location.
A SUE may also be used when a memory controller fails to store data after notifying the requestor the data is good. One example being loss of data in the logic path of the memory controller. The memory controller will force the SUE syndrome in those examples.
When dealing with a two bit failure, the failure may contain either a two bit hard or two bit soft failures or a combination of a hard and soft bit failures. Once an error is detected, through the use of the Hamming Code it is determined whether the error is a H-H, H-S/S-H, S-S or an SUE, each error condition having a special binary combination.
Once the error is detected, there are methods available for correcting particular types of errors. For example a hard-hard error can be corrected using a complement/recomplement algorithm known in the art. This particular algorithm rewrites the binary complement of data to the array while retaining the original failing data state. Thus when data is written back to the original data; in the two bit hard case the bits retain their solid error state.
When there is a combination of hard-soft error, a single error correcting code algorithm is utilized. Under this algorithm, a solid error state is preserved due to the existence of the hard error.
However, under both algorithms no solid error state will be established when there is a soft-soft (S-S) error since both bits take on a new state and can not be corrected. Similarly, a SUE cannot be corrected either due to the nature of its uncorrectable state.
Therefore when the error detected contains a hard error in either or one of the bits, correction can be performed easily. There are algorithms that can correct two failing hard bits (H-H), or two failing hard or soft bits (H/S, or S/H). Nevertheless, the problem occurs when two soft state (S-S) bit failures are detected. Although detection of two bit soft state failures are possible, there are no known algorithm or code today for correcting a two bit soft error. It has been an ongoing challenge to provide an algorithm to ensure this type of error correction.
Hence those concerned with the development of error correction algorithms have recognized the need for a soft-soft or SUE correction code. The present invention fulfil this need and others.