1. Field of the Invention
This invention is related to error correction code (ECC) systems.
2. Description of the Related Art
Error codes are commonly used in electronic systems to detect and correct data errors, such as transmission errors or storage errors. For example, error codes may be used to detect and correct errors in data transmitted via any transmission medium (e.g. conductors and/or transmitting devices between chips in an electronic system, a network connect, a telephone line, a radio transmitter, other wireless transmission, etc.). Error codes may additionally be used to detect and correct errors associated with data stored in the memory of computer systems. One common use of error codes is to detect and correct errors of data transmitted on a data bus of a computer system. In such systems, error correction bits, or check bits, may be generated for the data prior to its transfer or storage. When the data is received or retrieved, the check bits may be used to detect and correct errors within the data.
Component failures are a common source of error in electrical systems. Faulty components may include faulty memory chips or faulty data paths provided between devices of a system. Faulty data paths can result from, for example, faulty pins, faulty data traces, or faulty wires. Additionally, memory modules, which may contain multiple memory chips, may fail. Circuitry which drives the data paths may also fail.
Another source of error in electrical systems may be so-called “soft” or “transient” errors. Transient memory errors may be caused by the occurrence of an event, rather than a defect in the memory circuitry itself. Transient memory errors may occur due to, for example, random alpha particles or cosmic rays striking the memory circuit. Transient communication errors may occur due to noise on the data paths, inaccurate sampling of the data due to clock drift, etc. On the other hand, “hard” or “persistent” errors may occur due to component failure.
Generally, various error detection code (EDC) and error correction code (ECC) schemes are used to detect and correct memory and/or communication errors. For example, single error correct/double error detect (SEC/DED) schemes have been popular in the past. However, both hard and soft errors in a memory chip may cause multibit errors in the output of that chip. SEC/DED schemes may often not detect such errors, reducing reliability. Accordingly, “Chip-Correct” schemes have been introduced (also referred to as Chipkill ECC memory™, a trademark of International Business Machines Corporation). Generally, Chip-Correct schemes are designed to detect multi-bit errors occurring in a single memory chip, and to correct those errors.
One Chip-Correct ECC scheme uses Reed-Solomon (RS) codes to define the check bits. An RS code treats the data to be protected as symbols having b bits, where b is an integer greater than one. For example, b may be the number of bits of the data that are stored in an individual memory chip. Generally, RS codes may be designed to detect and correct errors in one or more symbols of the protected data. FIG. 1 is a diagram illustrating the equations used for a typical RS code to correct one symbol error (e.g. one or more bit errors in one memory chip). The RS code is based on Galois Field (GF) arithmetic. Generally, a Galois Field is a finite field of numbers having the property that arithmetic operations on field elements (numbers in the field) have a result in the field (i.e. another element of the field). An element of a field will be noted herein as “ei”, except for 0, which will be noted as “0”. Addition may be defined in a Galois Field of size 2b to be bitwise exclusive OR (XOR) of the elements and multiplication of two elements ei and ej may be defined as e(i+j)mod(2b−1).
The first equation shown in FIG. 1 (labeled the magnitude equation) calculates the syndrome s0 as the sum (in GF(2b)) of a set of symbols d0 through dn−1. That is, each symbol d0 to dn−1 is an element of GF(2b). If there are no errors, the sum is zero. The second equation shown in FIG. 1 (labeled the locator equation) multiplies (in GF(2b)) each symbol d0 to dn−1 by a distinct, non-zero element of GF(2b) (e0 to en−1 in FIG. 1). The sum of the multiplications is s1, and is also equal to zero for the error free case. On the other hand, an error of magnitude ej may occur in the kth memory. That is, ej may identify the bits that are in error within the symbol dk. If such an error occurs, the output of the kth memory is changed by ej, or (in GF(2b)), the output may be the original data+ej. Thus, s0=ej if such an error occurs, detecting the error (because s0 is not zero) and providing the magnitude of the error. In the locator equation, each symbol is multiplied by a distinct, non-zero element of GF(2b). Accordingly, an error of ej magnitude in the kth memory results in s1=ek×s0 (in GF(2b)). Thus, k may be determined, locating the error. The error may then be corrected based on the magnitude of the error.
Two check symbols, each having b bits, are included along with the data in the symbols d0 to dn−1. For example, dn−1 and dn−2 may be the check symbols. Symbol dn−1 may be generated when the data is written to memory to ensure that s0 equals zero (e.g. the sum, in GF(2b) of the other symbols). Symbol dn−2 may be generated when the data is written to memory to ensure that s1 equals zero. Accordingly, the RS codes require 2b check bits (or 2 check symbols). Unfortunately, adding the memory to store the 2b check bits may be cost-prohibitive in some cases. However, returning to SEC/DED codes (which may use fewer check bits) may not provide the desired level of reliability.
Note that the multiplications in GF(2b) of ei by a symbol (e.g. the multiplications illustrated in the s1 equation) may be the equivalent, in the bit domain, of a matrix multiplication of a b×b matrix whose columns are ei+b−1, ei+b−2, . . . to ei and the symbol represented as a b×1 matrix. Each column of the b×b matrix is constructed with the top bit as the most significant bit of the element forming that column. Thus, the b×b matrix corresponding to multiplication by e0 has the columns eb−1, eb−2, . . . to e0 (referred to as the base matrix). Matrices for multiplying by ek are obtained by multiplying the columns in the base matrix by ek.