The present invention relates to an error detection and correction system and, more particularly, to a system for detecting and correcting calculation errors that occur in a computer processor or an arithmetic logic unit (ALU).
In the field of computer technology, much effort has been expended in attempting to improve and ensure integrity of data processing. Specifically, whenever data is transferred from one component of a computer system to another and whenever data is mathematically manipulated, there is a risk that resulting data will be inaccurate. In certain high performance computing systems, the risk is increased by the fact that a great number of data transfers or mathematical operations occur in a short period of time.
Almost since the inception of computer processors, error detection and correction mechanisms have been devised to help reduce the risk of inaccurate data transfer and manipulation. Heretofore, one of the conventional approaches to ensure data integrity has been to add a code to a data stream prior to transferring or arithmetically manipulating it. This approach has proven relatively successful, but only for certain types of operations.
One of the earliest methods for detecting errors during data transfers, for example, was the parity check code. A binary code word has odd parity if an odd number of its digits are l's. For example, the number 1011 has three 1 digits and therefore has odd parity. Similarly, the binary code word 1100 has an even number of 1 digits and therefore has even parity.
A single parity check code is characterized by an additional check bit added to each data word to generate either odd or even parity. An error in a single digit or bit in a data word would be discernible since the parity check bit associated with that data word would then be reversed from what is expected. Typically, a parity generator adds the parity check bit to each word before transmission. This technique is called padding the data word. At the receiver, the digits in the word are tested and if the parity is incorrect, one of the bits in the data word is considered to be in error. When an error is detected at a receiver, a request for a repeat transmission can be given so that the error can be corrected. Only errors in an odd number of digits can be detected with a single parity check, since an even number of errors results in the parity expected for a correct transmission. Moreover, the specific bit in error cannot be identified by the parity check procedure as hereinabove described.
A more sophisticated error detection system was later devised. Data words of a fixed length of bits were grouped into blocks of a fixed number of data words each. Parity checks were then performed between different data words as well as for each individual data word. The block parity code detected many patterns of errors and could be used not only for error detection, but also for error correction when an isolated error occurred in a given row and column of the matrix. While these geometric codes were an improvement over parity check bits per se, they still could not be used to detect errors that were even in number and symmetrical in two dimensions.
After parity check codes and geometric codes were devised, a code was invented by Hamming, after whom it is named. The Hamming code is a system of multiple parity checks that encodes data words in a logical manner so that single errors can be not only detected but also identified for correction. A transmitted data word used in the Hamming code consists of the original data word and parity check digits appended thereto. Each of the required parity checks is performed upon specific bit positions of the transmitted word. The system enables the isolation of an erroneous digit, whether it is in one of the original data word bits or in one of the added parity check bits.
If all the parity check operations are performed successfully, the data word is assumed to be error free. If one or more of the check operations is unsuccessful, however, the single bit in error is uniquely determined by decoding so-called syndrome bits, which are derived from the parity check bits. Once again, only single bit errors are detected and corrected by use of the conventional Hamming code. Double bit errors, although detectable by the Hamming code, are not correctable.
The Hamming code is only one of a number of codes, generically called error correcting codes (ECC's). Codes are usually described in mathematics as closed sets of values that comprise all the allowed number sequences in the code. In data communications, transmitted numbers are essentially random data patterns which are not related to any predetermined code set. The sequence of data, then, is forced into compliance with the code set by adding to it at the transmitter, as hereinabove mentioned. A scheme has heretofore been developed to determine what precise extra string to append to the original data stream to make the concatenation of transmitted data a valid code. There is a consistent way of extracting the original data from the code value at the receiver and to deliver the actual data to the location where it is ultimately used. For the code scheme to be effective, it must contain allowed values sufficiently different from one another so that expected errors do not alter an allowed value such that it becomes a different allowed value of the code.
A cyclic redundancy code (CRC) consists of strings of binary data evenly divisible by a generator polynomial, which is a selected number that results in a code set of values different enough from one another to achieve a low probability of an undetected error. To determine what to append to the string of original data, the original string is divided as it is being transmitted. When the last data bit is passed, the remainder from the division is the required string that is added since the string including the remainder is evenly divisible by the generator polynomial. Because the generator polynomial is of a known length, the remainder added to the original string is also of fixed length.
At the receiver, the incoming string is divided by the generator polynomial. If the incoming string does not divide evenly, an error is assumed to have occurred. If the incoming string is divided by the generator polynomial evenly, the data delivered to the ultimate destination is the incoming data with the fixed length remainder field removed.
It has been found, however, that appending or concatenating a code to data to be transferred or arithmetically manipulated is burdensome, requiring additional and often expensive logic to accomplish. Moreover, the time required to generate the code on the transferring end and to decode and verify the code on the receiving end is, in certain cases, unacceptable. In the case of data manipulation and verification of proper ALU operation especially, additional codes result in inefficient performance.
Moreover, the aforementioned error detection and/or correction systems have been used most generally in transmitting and receiving data, rather than in acting on data mathematically. Thus, the communications channels were tested, but the computing engines were not. Techniques for correcting errors in arithmetic operations have conventionally been relegated merely to reperforming the same operations on the same processor or on other processors.
Finally, due to an inevitable comparison step in the detection cycle of the processes of the prior art, correction of errors could occur only some appreciable time thereafter--an untenable situation for high speed processing units.
It would be advantageous to provide a system for detecting errors in arithmetic operations without the need of composing and decoding a code appended to a data stream.
It would also be advantageous to provide a system for detecting and correcting errors in arithmetic operations by using a minimum amount of logic.
It would also be advantageous to provide a system for detecting and correcting errors in arithmetic operations in a short period of time (e.g., one or two clock cycles).
It would further be advantageous to provide a system for detecting and correcting errors in arithmetic operations that would provide a signal indicating that an error occurred therein, while the results of such arithmetic operations could nevertheless be corrected automatically.
It would also be advantageous to provide a system for preempting incorrect results of an arithmetic operation with correct results therefor.
It would also be advantageous to provide a system for detecting and correcting errors in arithmetic operations in which two arithmetic logic units could calculate the arithmetic operation independently and by different techniques, thus arriving at verifiable accurate results.