1. Field of the Invention
This invention relates, in general, to error correction codes for computers and data communications, and in particular to a special coding for encoding special uncorrectable errors for computer failure isolation.
2. Description of the Related Art
Error correction codes (ECCs) have long been used in computers as well as data communications. Typically, such codes are constructed by appending r=n−k check symbols to k message symbols to form an n-symbol code word, using a linear matrix transformation of the form:C=MG,where C=(c0, c1, . . . , cn-1) is a 1×n row vector representing the n-symbol code word, M=(m0, m1, . . . mk-1) is a 1×k row vector representing the k-symbol message or data word, and G is a k×n matrix known as a generator matrix. (Alternatively, if C and M are assumed to be column vectors, the transformation becomes C=GtM, where Gt is the transpose of G.) Although the symbols need not be bits, they are usually bits, and bits will be referred to in the discussion that follows. The code word C is either written to a storage medium or transmitted over a communication channel. Both the storage medium and the communication channel in the narrow sense may be regarded as “communication channels” in the broad sense.
On the decoding side, an n-bit word R is either retrieved from a storage medium or received over a communication channel. This word R is the sum of the originally generated code word C and an n-bit error word E (which may be zero) representing any errors that may have occurred. To determine whether the received word accurately represents the original code word C, the received word R is used to generate a k-bit syndrome vector using a matrix transformation of the form:S=RHt,where S is the syndrome vector and Ht is the transpose of an r×n matrix H known as a parity check matrix. (Alternatively, if S is assumed to be a column vector, the transformation becomes S=HRt, where Rt is the transpose of R.)
The parity check matrix H is selected so that its row vectors lie in the null space of those of the generator matrix G (i.e., GHt=0), so that for an original code word C,CHt=0.
Since R=C+E,S=(C+E)Ht=CHt+EHt=0+EHt=EHt.
In other words, the syndrome vector S is independent of the original code word C and a function only of the error word E. The decoder uses the syndrome vector S to reconstruct the error word (following maximum likelihood criteria), which is subtracted from the received word R to regenerate the code word C.
A code is the set of code words C generated from a given set of data words M. (Even if two generator matrices differ, their code spaces may be the same.) Codes are commonly classified by the number of symbols in their data word M and code word C. Thus, an (n, k) code has a code word of n symbols generated on a data word of k symbols.
The ability of a code to detect and correct errors depends on the so-called Hamming distance between different code words of the code. In general, the Hamming distance between two code words is the number of symbols in which the two code words differ. If the minimum Hamming distance of a code is t+1, then the code can detect up to t errors, since if the code word has t or fewer errors, it will not have changed into any other code word. Similarly, if the minimum Hamming distance of a code is 2t+1, the code can correct up to t errors, since a received word having t or fewer errors will be within a Hamming distance of t symbols on one and only one code word, and thus can be unambiguously decoded as that code word.
Furthermore, if the minimum Hamming distance of a code is 2t+2, the code can correct up to t errors and also can detect t+1 errors, since a received word having t+1 errors will not be within a Hamming distance of t symbols from any code word and thus will be detected as having uncorrectable errors (UEs). From the foregoing, it will be apparent that to correct 2 or fewer errors and simultaneously detect 3 errors, a code must have a minimum Hamming distance of 6 symbols. Such codes are commonly referred to as double error correcting and triple error detecting (DEC-TED) codes. To give another example, codes with a minimum Hamming distance of 4 symbols can correct a single error and detect up to 2 errors, and are known as single error correcting and double error detecting (SEC-DED) codes.
Single error correcting and double error detecting (SEC-DED) codes have been widely used to protect computer memory subsystems from failures. As certain critical data such as storage protection keys in computers requires a higher level of error protection, SEC-DED codes may not be adequate. In this case, a double error correcting, triple error detecting (DEC-TED) code may be desired.
Error correction codes capable of correcting double errors and detecting triple errors can be constructed based on the well-known BCH (Bose-Chaudhuri-Hocquenghem) theory (see W. Peterson and E. J. Weldon Jr., Error-Correcting Codes, 1972, MIT Press). A primitive BCH DEC-TED code of length n=2m−1 with 2m+1 check bits is obtained with a parity check matrix, each column vector k of which consists of 1, αk and α3k, where α is a primitive element of the finite field of 2m elements. Olderdissen describes a rapid decoding of the primitive BCH DEC-TED codes in U.S. Pat. No. 4,556,977. On the other hand, a non-primitive BCH DEC-TED code of length n=2m+1 with 2m+1 check bits can be obtained with a parity check matrix, each column vector k of which consists of 1 and βk, where β is a primitive root of xn−1 in the finite field of 22m elements. A non-primitive BCH DEC-TED code provides two more data bits than a BCH DEC-TED code with the same number of check bits. One drawback of the Olderdissen decoding scheme is that it is not applicable to non-primitive BCH DEC-TED codes. In U.S. Pat. No. 4,117,458, Burghard and Coletti describe a decoding scheme based on a brute force table-look-up approach for a non-BCH code of length 17 (=24+1) with 8 data bits and 9 check bits. In addition, their error detection scheme is limited to triple errors. The decoding table does not detect multiple errors beyond three that are theoretically detectable.
Recent ECC design for computer applications requires the ability to detect memory address errors as well as the ability to isolate component failures with invalid data indicators (see for examples U.S. Pat. No. 6,457,154 for memory address error detection and U.S. Pat. No. 6,519,736 for failure isolation of computer components with invalid data indicators). For memory address error detection, extra data bits are required for encoding the parity of a memory address. A special data invalid indicator, also known as a special UE (SPUE) indicator, is generated when the data sent out of a particular computer component to the memory is known to be bad. As the special UEs come from different computer components, it is desirable to be able to identify the source that generates a particular special UE when the data associated with the special UE is fetched from the memory. To meet this requirement, extra data bits are also required for the encoding of the special UEs. In the prior art, a plurality of data bits are reserved for multiple special UEs, which is inefficient in the usage of ECC data bits, especially when the number of available ECC data bits is limited.