1. Field of the Invention
This invention relates to error detection and correction in electronic systems and, more particularly, to systems that employ error correction codes to facilitate detection and correction of bit errors.
2. Description of the Related Art
Error codes are commonly used in electronic systems to detect and correct data errors, such as transmission errors or storage errors. For example, error codes may be used to detect and correct errors in data transmitted via a telephone line, a radio transmitter, or a compact disc laser. Error codes may additionally be used to detect and correct errors associated with data stored in the memory of computer systems. One common use of error codes is to detect and correct errors of data transmitted on a data bus of a computer system. In such systems, error correction bits, or check bits, may be generated for the data prior to its transfer or storage. When the data is received or retrieved, the check bits may be used to detect and correct errors within the data.
Component failures are a common source of error in electrical systems. Faulty components may include faulty memory chips or faulty data paths provided between devices of a system. Faulty data paths can result from, for example, faulty pins, faulty data traces, or faulty wires.
Hamming codes are a commonly used type of error code. The check bits in a Hamming code are parity bits for portions of the data bits. Each check bit provides the parity for a unique subset of the data bits. If an error occurs (i.e. one or more of the data bits unintentionally change state), one or more of the check bits upon regeneration will also change state (assuming the error is within the class of errors covered by the code). By determining the specific bits of the regenerated check bits that changed state, the location of the error within the data may be determined. For example, if one data bit changes state, this data bit will cause one or more of the regenerated check bits to change state. Because each data bit contributes to a unique group of check bits, the check bits that are modified will identify the data bit that changed state. The error may be corrected by inverting the bit identified as being erroneous.
One common use of Hamming codes is to correct single bit errors within a group of data. Generally speaking, the number of check bits must be large enough such that 2kxe2x88x921 is greater than or equal to n+k where k is the number of check bits and n is the number of data bits. Accordingly, seven check bits are typically required to implement a single error correcting Hamming code for 64 data bits. A single error correcting Hamming code is capable of detecting and correcting a single error.
FIGS. 1-3 illustrate an example of a system employing a single-error correction (SEC) Hamming code. In this example, four data bits (D4, D3, D2, and D1) are protected using three check bits (P1, P2, and P3). The parity generator 10 (FIG. 1) is used to encode the data block that contains the data bits and the check bits. The encoding process is performed prior to storing or communicating the data. FIG. 2 shows an assignment of data bits to calculate the check bits. In this example, the check bit P1 is generated by an XOR (exclusive OR) of the binary values in D4, D3, and D1. Similarly, the check bit P2 is generated by an XOR of the binary values in D4, D2, and D1, and the check bit P3 is generated by an XOR of the binary values in D3, D2 and D1. FIG. 3 shows the bit positions and the corresponding content of these positions within the encoded data block. The data block, which includes the data bits and the generated check bits, may then be stored in a memory chip or communicated over a data communication path.
At the point of receipt, the data block is retrieved and decoded. The decoding process involves performing a validity check on the received word, and executing an error correction technique if an error was detected. To check whether an error occurred in the storage (or transmission) of the data block, the check bits P1, P2, and P3 are effectively regenerated using the received data, and each regenerated check bit is XORed with the corresponding received check bit to generate a corresponding syndrome bit. FIG. 4 is a table depicting a manner in which these syndrome bits may be generated. More particularly, syndrome bit S1 may be generated by XORing the received binary values in P1, D4, D3, and D1. If none of the received data bits (D4, D3, D1) is erroneous, the value of the received check bit P1 is effectively XORed with itself, and the syndrome bit S1 will be 0 (assuming the original check bit P1 is not erroneous). If one of the data bits (D4, D3, D1) or the check bit P1 is erroneous, the syndrome bit S1 will be 1 (asserted), thus indicating an error. Syndrome bits S2 and S3 may be generated similarly. Taken collectively, the syndrome bits S1, S2 and S3 may be used to identify the location of an erroneous bit. For example, the binary value of the syndrome bits in the order [S3, S2, S1] indicates the position of the erroneous bit within the 7 bit data block as depicted in FIG. 3. If the syndrome code is all zeros (i.e. xe2x80x9c000xe2x80x9d), the data has no single bit error. Upon identification of the erroneous bit position, the error is corrected by inverting the binary value in that position, i.e. from 0 to 1 or from 1 to 0.
It is a common practice to store data in, or communicate data through, multiple components. For example, a data block may be stored in a plurality of memory chips, or it may be communicated through a plurality of wires. An error may be introduced if one of the components is faulty. A Hamming code such as that described above may be used to address error correction in such systems.
For example, consider the case of storing D bits of data that are protected by C check bits using M memory chips. The data block therefore contains D+C bits. If the data block is to be evenly divided among the M memory chips, each memory chip will store X of the data and/or check bits of the data block, where X=(D+C)/M. The standard approach to providing error correction for chip failures is to divide the D+C data and check bits into X logical groups each including M bits, and assigning 1 bit from each chip to each of the groups. The check bits in each group form a SEC (single-error correcting) code such as a Hamming code. When any chip fails, it introduces at most one error into each group, and these errors are corrected independently using the SEC codes. If a Hamming code is used in each group, a total of C=X*L check bits are required, where L is the smallest integer such that 2{circumflex over ( )}L greater than M. This standard approach is inefficient because each group is able to independently identify which bit (if any) within the group is in error. However, if the only failures considered are memory chip failures, the failures in different groups are highly correlated.
In some systems, in addition to correcting single-bit errors due to component failures, it may also be desirable to detect any double-bit errors that may occur. The standard approach is to evenly divide the data block among the memory chips in the manner as described above, and to generate check bits for each group which form an SEC-DED (single-error correcting, double-error detecting) code such as an extended Hamming code. When any chip fails, it introduces at most one error into each group, and these errors are corrected independently using the SEC-DED codes. When two arbitrary bits are in error, they are either corrected (if they lie in different groups) or are detected (if they lie in the same group). If an extended Hamming code is used in each group, a total of C=X*L check bits are required, where L is the smallest integer such that 2{circumflex over ( )}(Lxe2x88x921) greater than M. Similar to the foregoing discussion, however, the use of extended Hamming codes in such systems is inefficient.
It would be desirable to provide a system and method which allow for the reliable storage or transmission of data in environments wherein component failures are possible. In particular, it would be desirable to provide a system and method which allow for the detection of arbitrary double-bit errors while performing correction of errors due to component failures where the number of required check bits may be reduced.
The problems outlined above may in large part be solved by a system and method for detecting and correcting errors in a data block in accordance with the present invention. In one embodiment, a system includes a check bits generation unit which receives and encodes data to be protected. The check bits generation unit effectively partitions the data into a plurality of logical groups. The check bits generation unit generates a parity bit for each of the logical groups, and additionally generates a pair of global error correction codes, referred to generally as a first global error correction code and a second global error correction code. In one implementation, data at corresponding bit positions within the logical groups are conveyed through a common component, such as the same wire, or are stored in the same component, such as the same memory chip. Additionally, data bits at different bit positions within a given logical group are not conveyed through, or are not stored within, a common component.
In one particular embodiment, the data is divided into a total of X logical groups. The first global error correction code (also referred to in this embodiment as an xe2x80x9cuntwistedxe2x80x9d global error correction code) is equivalent to the result of generating an individual error correction code for each logical group and XORing the collection of individual error correction codes together. The second global error correction code (also referred to in this embodiment as the xe2x80x9ctwistedxe2x80x9d global error correction code) is equivalent to the result of (or may be derived by) shifting (either linearly or cyclically) the error correction code for a given ith group by i bit positions, wherein i=0 to Xxe2x88x921, and by XORing corresponding columns of the resulting shifted error correction codes together. The data and the check bits (collectively formed by the parity bit for each logical group and the first and second global error correction codes) are then conveyed through a communication channel or are stored in memory.
The system further includes an error correction unit which is coupled to receive the plurality of data bits and the check bits following storage or transmission. The error correction unit is configured to generate a parity error bit for each of the logical groups of data based on the received data bits and the original parity bits. The parity error bits indicate whether a change in parity for each logical group has occurred.
The error correction unit is further configured to generate a regenerated first global error correction code in the same manner in which the original first global error code is derived, using the received data. Thus, in one embodiment, the regenerated first global error correction code is equivalent to the result of generating an individual error correction code for each logical group (of the received data), and XORing them together. A first global syndrome code is then generated by XORing the original first global error correction code with the regenerated first global error correction code.
Subsequent operations of the error correction unit are dependent upon whether an odd number of the parity error bits is asserted or an even number of the parity error bits asserted. In one particular embodiment, in response to detecting that an odd number of the parity error bits are asserted, the error correction unit uses a binary value of the first syndrome code to determine the bit position of any errors within any of the logical groups. Using this information, the error correction unit may correct the errors by inverting the values at the positions indicated as being erroneous.
In an alternative operation, in response to detecting that an even number of the parity error bits are asserted, the error correction unit determines whether the first syndrome code has an all-zeros value. If not, the error correction unit generates an error signal indicating that an uncorrectable error in the data exists. This condition will occur whenever any uncorrectable double-bit error in the data block is present (i.e., when two bits in different positions in any of the logical groups have errors). On the other hand, if the first global syndrome code has an all-zeros value, the error correction unit determines whether any of the parity error bits are asserted. If not, the data is determined to be correct as received. If any of the parity error bits are asserted, the error correction unit generates a regenerated second global error correction code in the same manner in which the original second global error correction code is derived. Thus, in one embodiment, the regenerated second global error correction code is equivalent to the result of (or may be derived by) shifting the regenerated error correction code for a given ith group by i bit positions, wherein i=0 to Xxe2x88x921, and by XORing corresponding columns of the resulting shifted error correction codes together. A second global syndrome code is then generated by XORing the original second global error correction code with the regenerated second global error correction code. The error correction unit then uses the binary value of a row syndrome code derived from the second global syndrome code to determine the position of an error in any of the logical groups for which an error is indicated (pursuant to the corresponding parity error bits), and corrects the errors, if present. In alternative embodiments, the second global syndrome code (rather than the first syndrome code) is also used to determine the position of correctable errors when an odd number of the parity error bits is set.
In general, the first global syndrome code is generated such that it is indicative of situations in which two bit errors in different bit positions within the logical groups are present in the received data. In such situations, the first global error correction code is different from the regenerated first global error correction code. In the embodiment described above, if the first syndrome code is not an all zeros value when an even number of parity error bits is set, a double bit error is indicated. In other embodiments, other predetermined values of the first syndrome code may indicate a double bit error.
In addition, in general the second global syndrome code is generated such that, with knowledge of the specific logical groups that have a single bit error, a value indicative of the location of the error in such groups may be derived from the global syndrome code. The overall number of bits forming the global syndrome code and the parity bits for each logical group is smaller than the overall number of bits needed for the error correction codes individually associated with the logical groups.
The system accommodates the detection of arbitrary double-bit errors in the data block while performing correction of errors due to component failures. Advantageously, the overall number of required check bits (the parity bits for the logical groups and the bits forming the first and second global error correction codes) may be smaller than the overall number of bits needed to implement a system using extended Hamming codes.