The present invention relates to a method and system for detecting errors in a memory subsystem and, more particularly, to a system for determining whether such detected errors are hard or soft errors.
In any digital system where data is stored, transmitted or received, one or more of the data bytes may be received and stored in error. In fact, a data value may even be corrupted spontaneously due to impingement of the memory device by alpha particles or due to other random, unpredictable events. This has been a problem from the time data processing systems were first invented.
As more sophisticated data processing operations are performed, involving more complex equipment, there is a greater need for systems to detect and correct stored errors. For example, operations such as merging files, sorting of data within files, numerical/statistical analyses, complex data handling procedures and word processing operations require increased reliability in data transfer and storage. If data errors occur and are undetected, valuable information and system operation itself may be affected. Thus, error detecting and correcting features are not only advantageous, they are required to improve system integrity.
In response to the problem of error generation, systems have been developed to detect such errors. One of the earliest methods for detecting errors was the parity check code. A binary code word has odd parity if an odd number of its digits are 1's. For example, the number 1011 has three 1 digits and therefore has odd parity. Similarly, the binary code word 1100 has an even number of 1 digits and therefore has even parity.
A single parity check code is characterized by an additional check bit added to each data word to generate either odd or even parity. An error in a single digit or bit in a data word would be discernible since the parity check bit associated with that data word would then be reversed from what is expected. Typically, a parity generator adds the parity check bit to each word before transmission. This technique is called padding the data word. At the data requesting device or receiver, the digits in the word are tested and if the parity is incorrect, one of the bits in the data word is considered to be in error. When an error is detected at a receiver, a request for a repeat transmission can be given to memory so that the error can be corrected. Only errors in an odd number of digits can be detected with a single parity check, since an even number of errors results in the parity expected for a correct transmission. Moreover, the specific bit in error cannot be identified by the parity check procedure as hereinabove described.
A more sophisticated error detection system was later devised. Data words of a fixed length of bits were grouped into blocks of a fixed number of data words each. Parity checks were then performed between different data words as well as for each individual data word. The block parity code detected many patterns of errors and could be used not only for error detection, but also for error correction when an isolated error occurred in a given row and column of the matrix. While these geometric codes were an improvement over parity check bits per se, they still could not be used to detect errors that were even in number and symmetrical in two dimensions.
After parity check codes and geometric codes were devised, a code was invented by R. W. Hamming, after whom it is named. The Hamming code is a system of multiple parity checks that encodes data words in a logical manner so that single errors can be not only detected but also identified for correction. A transmitted data word used in the Hamming code consists of the original data word and parity check digits appended thereto. Each of the required parity checks is performed upon specific bit positions of the transmitted word. The system enables the isolation of an erroneous digit, whether it is in one of the original data word bits or in one of the added parity check bits.
If all the parity check operations are performed successfully, the data word is assumed to be error free. If one or more of the check operations is unsuccessful, however, the single bit in error is uniquely determined by decoding so-called syndrome bits, which are derived from the parity check bits. It should be noted once again that only single bit errors are detected and corrected by use of the conventional Hamming code. Double bit errors, although detectable by the Hamming code, are not correctable.
The Hamming code is only one of a number of codes, generically called error correcting codes (ECC's). Codes are usually described in mathematics as closed sets of values that comprise all the allowed number sequences in the code. In data communications and processing, transmitted or transferred numbers are essentially random data patterns which are not related to any predetermined code set. The sequence of data, then, is forced into compliance with the code set by adding to it at the transmitter, as hereinabove mentioned. A scheme has heretofore been developed to determine what precise extra string to append to the original data stream to make the concatenation of transmitted data a valid code. There is a consistent way of extracting the original data from the code value at the receiver and to deliver the actual data to the location where it is ultimately used. For the code scheme to be effective, it must contain allowed values sufficiently different from one another so that expected errors do not alter an allowed value such that it becomes a different allowed value of the code.
A cyclic redundancy code (CRC) consists of string of binary data evenly divisible by a generator polynomial, which is a selected number that results in a code set of values different enough from one another to achieve a low probability of an undetected error. To determine what to append to the string of original data, the original string is divided as it is being transmitted. When the last data bit is passed, the remainder from the division is the required string that is added since the string including the remainder is evenly divisible by the generator polynomial. Because the generator polynomial is of a known length, the remainder added to the original string is also of fixed length.
At the receiver, the incoming string is divided by the generator polynomial. If the incoming string does not divide without remainder, an error is assumed to have occurred. If the incoming string is divided by the generator polynomial without remainder, the data delivered to the ultimate destination is the incoming data with the fixed length remainder field removed.
A so-called "soft" error (i.e., one that is correctable) could be the result of an alpha particle destroying the charge on a memory cell. This is known as a soft error because the data can be rewritten and the cell can perform properly again. A soft error is correctable via an Error Detection and Correction (EDAC) chip.
If the error is a result of a memory cell stuck at logic one or zero, the memory cell can no longer function properly and should be removed. This is known as a "hard" or uncorrectable error.
Both types of errors are symptomatic of potential memory cell degradation in DRAMs. Distinguishing between soft and hard errors is helpful in determining the effects of system operation.
The probability of at least one error occurring during a predetermined time interval increases in direct proportion to the number of such memory devices or size of overall memory. Thus, larger, more complex computer systems are especially susceptible to soft errors due to random anomalies and are therefore also susceptible to defectively stored data.
U.S. Pat. Nos. 4,371,930 and 4,375,664 both issued to Kim, teach the use of error detection, correction and selective logging apparatus for single bit memory errors. Only single bit errors which the system determines most likely to be solid (hardware-related) errors are logged. The solid error situation occurs when two single bit errors have consecutive memory addresses.
U.S. Pat. Nos. 4,523,313 and 4,527,251, both issued to Chester M. Nibby, Jr., et al, teach the use of partially good memory devices in a semiconductor memory system. The defective portions of memory devices in the aforementioned references, however, are hard, uncorrectable portions, while the good portions are consistently reliable and identified as such before they are incorporated into the computer system. Identifying errors that may be correctable (soft errors) is beyond the scope of these references.
U.S. Pat. No. 4,535,455 issued to Peterson teaches a system for correcting and logging transient errors that occurred in a block of memory locations in a memory device. A microprocessor under the control of a program accesses and reaccesses each memory location, rewrites each word of readout information, again reads each word, and finally logs only transient errors in an error rate table.
It would be advantageous to detect errors.
It would also be advantageous to log errors for future analysis.
It would also be advantageous to provide a method for determining whether a detected error is a hard or a soft error.
It would also be advantageous to provide an indicator for detected errors.