The pervasive use of computers in modern day society places stringent requirements on their reliability. One area where this is especially important is in the storage, manipulation and transmission of financial and other commercially significant information. For example, it is vital for the integrity of the banking system, which maintains most of its records on-line, that spurious errors do not creep into these records.
It is not realistic to build computing systems in which, prima facie, mistakes never occur. For example, memory devices and other components are normally exposed to random hits from cosmic rays, or natural background radioactivity. Such events can cause a physical change in the relevant device that will result in a change in its logical binary state, possibly leading to an erroneous data value being stored or otherwise processed. Another potential vulnerability is to lightning, which may cause a power surge, or some other form of electromagnetic interference. Likewise, data communications can be corrupted by extraneous noise.
In order to guard against such circumstances, it is routine for systems to incorporate error detection and/or correction schemes. The underlying concept here is to add redundancy to the data, in order to allow internal consistency to be verified. A break in this internal consistency will then reveal the presence of an error. In some schemes, the presence of an error can be detected, but not rectified, while other schemes allow for the automatic correction of certain errors. In accordance with standard terminology, we will refer to Error Correcting Codes (ECCs) to cover systems that perform error detection and/or error correction (i.e. an ECC may only detect errors, not necessarily correct them).
As a further point of terminology, the use of ECCs is particularly common in relation to the transmission and storage of data, but can also be utilised during other forms of data manipulation as well. We will use the term (data) processing herein to encompass all such forms of data manipulation, storage and transmission where ECCs may be employed.
A simple example of an ECC is the well-known parity bit. Thus if our data word (or block) has N bits, then the number of Is in the word is counted, and the parity bit is set so that in total (i.e. including the parity bit) there is an odd number of 1s. This leads to an augmented data (code) word of N+1 bits, comprising the original N-bit data word, plus the additional parity bit. If a single bit of the augmented N+1 bits is then corrupted somehow, so as to have the opposite polarity, this error can be detected, since the parity bit will no longer be correct. This is true irrespective of whether it is the original data word or the parity bit itself that is corrupted. (It is also known to use a parity scheme with even parity; i.e. the parity bit is selected so that the number of Is in the augmented data word is even).
If we define the distance between two code words as the number of bit positions in which they differ, then use of a parity bit ensures that there is a minimum distance of two, i.e. all code words differ by at least two bits. This is because if we start with two data words that are identical apart from a single bit position, then they must also differ in their parity bit as well, hence the minimum distance of two. A consequence of this is that the use of a solitary parity bit allows single bit errors to be detected, since the change of a single bit cannot lead to another valid code word, given that there is a minimum distance of two. On the other hand, if there is a double bit error (i.e. two bits change value), then this cannot normally be detected, since this will transform one valid code word into another potentially valid code word.
In addition, having a minimum distance of two implies that a single bit error, although detected, cannot be generally be rectified. This is because the corrupted code word is now exactly intermediate (at least) two valid code words; hence there is no unique correction value. Looking at this another way, the use of a single parity bit does not allow us to identify the location of a detected error. (It will be appreciated that with binary data, once the location of a single bit error has been determined, there is only one possibility for the correction—i.e. to flip the relevant bit).
It is known that in order to be able to correct single bit errors, the number of parity bits can be increased. One such scheme involves 3 parity bits for a 4-bit word (b3b2b1b0). If one parity bit corresponds to b3b2b1, one parity bit to b3b1b0, and one parity bit to b2b1b0, a single bit failure anywhere in the augmented code word (of 7=4+3 bits) can be uniquely identified in terms of its location. This is feasible because there are 8 (=23) possible outcomes of the three parity tests, and only seven possible locations of the error. The bit value at the identified location can then be reversed in order to restore the original data value. Another way of looking at this code is that it can be shown that there is a minimum distance of 3 between valid code words in this scheme. Consequently, any single bit error can always be corrected back to the closest code word. Alternatively, the code can be used to ensure that any two-bit error is detected, since such a two-bit error cannot lead to a valid code word (however, with this approach, the ability to correct single bit errors is sacrificed).
Of course, the ability to detect an error, and especially to correct an error, comes at the cost of increased redundancy. Thus in the 3-bit parity scheme, only 4/7 of the code word is real data, with the rest of the code word being occupied by the parity protection. This then requires a corresponding increase in bandwidth to transmit the same amount of underlying data (or additional capacity for storage, and so on). It will be appreciated that in contrast, a single parity bit scheme has an efficiency of N/N+1, where N (the length of the original data word before parity) can nominally be selected to give an arbitrarily high efficiency. Note however, that if N is made too large, the risk of an undetectable two-bit error increases, thereby undermining the whole effectiveness of the parity scheme.
The use of one or more parity bits can be generalised into the set of linear block binary codes, where a data word having k bits is encoded or mapped into a code word having n bits (n>k). This is known as an (n, k) code. A range of useful mappings has been mathematically derived on the basis of vector algebra and group theory. These include cyclic redundancy codes (CRCs), which provide a set of cyclically related code words. For a given set of data, the CRC is determined using a generator polynomial having certain mathematical properties. The correctness of the processed data can subsequently be confirmed by dividing the processed data by the same generator polynomial to calculate the (so-called) syndromes. A zero value for a syndrome indicates no errors, while a non-zero value implies a particular error or errors, depending on the specific non-zero value obtained. One advantage of CRCs is that the encoding/decoding can be performed by relatively simple digital electronics, hence their attraction in computing. Further details about CRC codes are widely available in the literature, see for example “Data Communications, Computer Networks and Open Systems” by Fred Halsall, 1995, Addison Wesley (ISBN 0-201-42293-X) (see especially section 3.4).
One important known set of CRCs is the Bose-Chaudhuri-Hocquenghem (BCH) family of codes, which employs a particular type of generator polynomial. These codes are especially useful for correcting multiple errors. Related to the BCH codes are the Reed-Solomon (RS) codes, which can be used with non-binary data.
Note that a linear block code is regarded as systematic if the code word is formed by simply appending the ECC bits to the original data word (as with a parity bit). The advantage of this is clearly that the data can be quickly accessed, without having to perform any formal decoding; rather this would only be required to perform error correction/detection. In contrast, for a non-systematic scheme, no such simple decomposition of a code word into the data word and ECC bits is possible; rather the original data word can only be recovered by a full decoding operation. Note that many generator polynomials have both a systematic and non-systematic form.
In linear block codes, each code word is independent of all other code words, and so they can be individually decoded. (In fact, sometimes such codes are employed at a hierarchical level, such as one parity word per line of data, and then another parity word covering the whole page of data, including the per line parity words). However, in another important known form of coding, the value of a code word depends not only on the current input data word, but also on the previous data input(s). This type of coding, which maintains history or state information, is generally referred to as convolution coding, and can be implemented in a straightforward manner by the use of electronic feedback circuits. Convolutional coding systems, modelled for example as Markov processes, are particularly used in data transmissions, where the data to be encoded naturally forms a sequence. An advantage of convolutional systems is that they offer particularly good robustness against noise, although the decoding techniques, which are typically based on maximum likelihood, can be rather complex. Further details about convolutional coding (and also linear block coding) can be found in “Error Correcting Coding Theory” by Man Young Rhee, McGraw-Hill, 1989 (ISBN 0-07-052061-5).
It will be appreciated therefore that there is a very wide range of known ECCs, and that the choice of a particular ECC for any given set of circumstances will depend upon (and usually be a trade-off between) a number of factors. Thus relevant issues in selecting an ECC include the loss of capacity due to increased redundancy, the amount of computational complexity and time that is available for encoding/decoding, the susceptibility of the system to errors in the first place, the most likely type of errors (random bits, bursts of consecutive bits, etc.), the nature of the processing (clearly convolutional codes are only available for sequential data processing), and the relevant importance of error detection vis-à-vis error correction (for example, in a communications network, it is often possible to request a re-transmission of erroneous data, rather than having to try to correct it).
In addition, there may be additional influences on the choice of ECC, beyond pure error detection/correction. For example, it may be desirable for the output signal to have approximately equal numbers of zeros and ones in order to prevent a dc bias, or to avoid a long run of either zeroes or ones, in order to minimise the risk of losing synchronisation. Similarly, an ECC can be designed to ensure that each code word has a consistent number of runs of ones in order to provide a form of self-clocking (this is often true in particular of bar codes).
A complex computer system may in fact employ a number of different ECCs in various parts of the system. The advantage of this is that it allows the ECC used in each location to be optimised according to the particular circumstances, as discussed above. However, this then requires the system to perform the necessary conversions between the different ECCs. For example, data transmitted over a bus to a network interface card may utilise one form of ECC, while another form of ECC is then employed for transmissions out over the network itself. Indeed, the interface card manufacturer may have no option but to perform such an ECC conversion in these circumstances, in that the network and bus protocols may be defined by separate standards that each requires its own particular ECC.
FIG. 1 schematically illustrates a system for performing ECC conversion, in which incoming data encoded in accordance with a first ECC (ECC1) is received and decoded by unit 210. This unit then outputs the data to encode unit 230, which recodes the data for onward transmission as code words (i.e. data plus a second ECC, namely ECC2). Of course, decode unit 210 will also process the incoming data to retrieve the ECC portions, in order to verify that the incoming data has been correctly received. It then sends the result of this ECC decoding to a control unit 220. The result typically indicates whether: (a) there was no error in the received data; (b) an error was detected in the received data, but the ECC has allowed the error to be successfully corrected; or (c) an error was detected using the ECC, but this error cannot be corrected. One way of implementing this is to have two binary lines from the decode unit 210 to the control unit 220, a first indicating the presence/absence of a correctable error, a second indicating the presence/absence of an uncorrectable error.
Once the control unit has received the error signal from decode unit 210, it can then generate an appropriate signal to send to encode unit 230. For example, if there is no error in the received data, or if the error has been corrected, then an enable signal can be supplied to the encode unit. On the other hand, no such enable signal (or a disable signal) is supplied if the decode unit 210 detected an uncorrectable error.
The control unit 220 may be designed to report or further investigate the presence of the error in the received data, even if this has been corrected. In addition, control unit 220 may in some implementations be omitted altogether, with decode unit 210 outputting the results of its ECC check directly to encode unit 230, where this can be used as the basis for an enable or disable signal. It will be appreciated of course that the control signals, however achieved, must be maintained in synchronism with the transfer of the corresponding data between decode unit 210 and encode unit 230.
Given that ECC operations are involved in a very wide range of system activities, it is important that they do not become a bottleneck. Thus decode unit 210 and encode unit 230 are generally implemented in hardware in order to ensure a sufficiently high processing speed and throughput. This ties in with the fact that ECC activities are normally performed at a very low level, as defined for example in multi-layer network communications models.
Note also that decode unit 210 and encode unit 230 are nearly always provided as separate circuits that perform discrete decode and then encode operations. Thus it is not generally realistic to try to perform an overall conversion directly from one ECC into another. One reason for this is the wide range of possible ECCs, so that it would be very difficult to cater for every single potential conversion requirement. Perhaps more importantly, most ECCs have been especially selected and designed so that encoding and/or decoding are a highly efficient operations that can be easily implemented by digital electronics. In contrast, a direct conversion operation from one ECC into another ECC is likely to be far more complex and problematic. In other words, it is normally much faster and simpler to perform a decode followed by a separate encode, rather than to attempt a conversion from one ECC format directly into another ECC format.
Unfortunately however, there is a risk associated with the use of separate decode and encode facilities, in that it is possible for data to become corrupted in transit between the two. It will be appreciated that at this stage there is no ECC associated with the data, and so such corruption cannot be detected (and certainly not corrected). Consequently, the corrupted data will then be encoded by unit 230 as if it were the correct data, and subsequently passed onto other system components for further processing, thereby spreading the error.
In fact, this vulnerability extends generally to the output side of decode unit 210 (once ECC1 has been removed from the incoming data), as well as to the input side of encode unit 230 (prior to addition of ECC2 to the outgoing data). Thus any error introduced into the data between these two points, whether through device malfunction, some extraneous event, such as a cosmic ray, or any other source, may subsequently propagate throughout the system as apparently legitimate, but actually erroneous, data. It will be appreciated that this is a most undesirable possibility, and can potentially undermine all the care taken with ECCs in the remaining portion of the system.