1. Field of the Invention
This invention relates to correcting and detecting errors that may occur within a computer system particularly within a memory device, and more particularly to systems where a single bit correction supplemented with familial 1 through 4 bit correction and double bit word-wide detection are preferred, and even more particularly to 128 bit data words stored in 4 bit RAM devices.
2. Background Information
It is expensive to dedicate memory to error correction code (ECC) space, therefore, compromises in the desire for perfect error correction and detection are needed. For sustainable commercial viability, one must still provide the largest computer systems particularly, and other RAM using data storage systems generally, with appropriate compromises in error detection and correction. Using some ECC to make memory subsystems more reliable by providing the capability to allow a single multi-bit RAM device to fail and dynamically correcting that failure and also providing the same capability for any 1, 2, 3, or 4 bits within a 4 bit RAM family and further providing for detection of any 2 bits of non-familial error anywhere in the word is the path we chose. This capacity will correct all single-bit and in-family 2, 3, or 4 bit errors on the fly, to produce a corrected data word, and identifies as unfixed (unfixable) and corrupted those data words with any other errors or error types. It is our belief that these are the most likely errors and that therefore our selected compromise is valuable.
As RAM device densities and memory subsystem bandwidth requirements increased over time, there was more pressure on the memory subsystem designers to use multi-data-bit RAM devices to meet their requirements. But to do so jeopardizes the reliability of the memory subsystem utilizing the standard Single Bit Correction/Double Bit Detection (SBC/DBD) of the past. As RAM device geometries become smaller and device failure rates increase, data words become more susceptible to failures that affect more than one bit in the device. Also, even though single bit errors are still the most predominant failure mode of RAM devices, soft single-bit failure rates are increasing do to the shrinking of the geometries and reliability characteristics of these devices. So it becomes more important to at least detect double bit errors from multiple devices, so that data corruption can be detected and safely handled. This invention provides for that protection. Providing enhanced error detection and enhanced error correction without substantial cost increases, due to increased ratio of redundant Error Correction Code (ECC) bits versus information data bits are additional goals of this invention.
There were two main methods of handling error correction and detection in the past. The predominant one was to create multiple SBC/DBD fields across the data word, and have each bit of the RAM go to separate SBC/DBD fields. The issue with this method is the additional costs of the RAMs to support the extra check bits. For example, if you had a 128-bit data word that needed protection and this 128-bit data word was implemented using ×4 RAM devices it would take 4 groups of 8 check bits to provide the same fault coverage as the proposed invention. These check bits would be implemented in (8) ×4 RAM devices. Our invention only needs 16 check bits or 4 RAM devices, rather than the 32 when using ×4 devices. For very large memories, the extra cost of that extra RAM is significant if not commercially prohibitive.
Another method is to use 2 ECC fields with each ECC field providing 2-bit “adjacency” correction. (The word “adjacency” in this art means within the family of bits (that is, of the bits) within a given RAM device, not necessarily bits which are “next-to” each other). This method would also need 4 RAM devices to implement the 2 groups of 8 check bits, and therefore would have the same cost. However, within each of the ECC fields, not all two-bit errors across multiple devices are detected. Therefore the cost is the same, but it doesn't have the same reliability characteristics.
The multi-bit adjacent error correction or Chip Kill is merged with double bit nonadjacent error detection. This entails the ability to detect and correct failures within a single RAM device, and to further detect failures that have resulted from soft or hard errors of any single bit in any two RAM devices within the 128-bit word. No other solution has ever achieved this. A unique ECC table is used in our invention in conjunction with a specific RAM error definition table (for syndrome decode), neither of which are in the prior art.
Prior inventions did not allow for the level of reliability that is present with an error code correction feature which combines single bit error correction and multi-bit adjacent correction with double bit non-adjacent error detection, at least not with a small number of additional ECC-type bits. (ECC means Error Correcting Code and is a common abbreviation in this art).
Thus, there is a need for error correction and detection at low memory cost and high reliability, and providing familial error correction allows for capturing the most likely to occur of the multi-bit within a word errors, those that occur within a single DRAM or RAM device. Accordingly, by thinking of the problem in this way, instead of trying to correct every possible error, we have designed an inventive and low cost error detection and correction system as set forth below.
There have been similar systems in the art, but these do not have all the advantages or requirements of our invention. Perhaps the closest reference in a U.S. Pat. No. 6,018,817 issued to Chen et al., and incorporated herein by this reference in its entirety. Using same sized (×4 bit) RAM devices, the Chen '817 reference requires 12 ECC bits for each 72 data bits if a 4-bit-wide RAM is used, while our invention handles sufficient reliability needs with only 16 bits of ECC for 128 data bits using 4-bit-wide RAMS. (RAM is the generic term, which includes DRAM, and while our preferred implementation was developed on DRAM chips, other RAM devices can be used). Further, Chen '817 requires 16 ECC bits per 72 data bits if they use ×8 RAM devices. Compared to either embodiment of Chen '817, our invention seems to produce more error checking and also possibly more error correction while requiring less ECC bits.
The specific code to support the 12 ECC bit code appears to be described in U.S. Pat. No. 5,757,823, Chen '823, (also incorporated herein by this reference). The cost savings related to an additional third of savings over Chen '823 will be appreciated by those of experience in these arts. As Chen mentioned in Col 1 lines 40-52 that even a 5% savings in memory commitment for main memory is very important to computer systems.
An additional patent of interest includes Blake et al, U.S. Pat. No. 5,682,394 which shows a disablement feature, and this is also incorporated herein by this reference.
Finally, Adboo et al., U.S. Pat. No. 5,490,155, also incorporated herein by this reference, describes a system for correcting ×4 DRAM errors, Adboo, as in our invention, uses 16 check bits for a 128-bit data word. However Adboo requires that the check bits be produced by two identical parity trees for each 64 bits, wherein each parity tree has the same number of inputs, and the outputs are paired to correct up to four bit errors within a single DRAM or RAM device. Perhaps more importantly, Adboo can only detect and correct one single bit error in a word or one two adjacent-bit errors in a word, or four adjacent bit errors in a word. Adboo cannot detect two unrelated single bit errors or a single bit error outside of a familial group having up to 4 bit errors, which our invention can do. As can be clearly seen with reference to Adboo's FIG. 9A, an error in two check bits (or many of the two bits unrelated errors, actually) that are unrelated or non-adjacent, yields an uncorrectable and undetectable error. For an example of this failing of Adboo, note that the code for bit C4 is 0001 and the code for C7 is 1000. XORing these two values leads to the result 1001, which indicates that bit 0 is in error! Thus if both C4 and C7 are in error, the syndrome will indicate that bit 0 is in error, an unacceptable situation, even if such an occurrence may be a rare event, because it missed two single bit errors.
Accordingly there is a need for stronger detection and correction of errors to improve the reliability of computer system memories and to do so with a minimal amount of data. An error correction system and chip-kill type system together with double bit non-familial error detection will provide a commercially most useful solution to this technical problem.
We describe our invention with reference to the drawings in the summary and detailed description sections below, but limit its scope only by the appended claims.