Current software tends to be intolerant of any hardware error. Hardware should be designed with a mean-time-between-failure (MTBF) that has a very large value, even in the presence of physical behaviors that cause errors at higher rates. For memories, this is typically achieved by adding redundant information to storage in the form of an error correcting code that can correct the expected patterns of errors.
In addition, software and users want to have a much larger mean-time-between-undetected errors. It is better to have the system fail in a predictable way, rather than use erroneous data silently (for instance, in managing a bank account). Recovery for these errors, not corrected by hardware, but detected, can be at some higher level of software, or as a full system restart.
Ideally, any error would be always detected, but that requires too much additional redundant storage. Note that the undetected error rate can never go to zero. The design of codes for DRAM error protection is a balance between predicted error modes, error correction capability to match a system MTBF goal, and error detection capability to match an undetected error goal.
Digital memories are susceptible to errors caused by a variety of sources. Cosmic radiation can flip the state of individual memory cells. Pattern-sensitive capacitive coupling, noise, and hardware failures such as shorts can occur, causing multiple bits to be read incorrectly. Sometimes entire memory chips can fail. When a memory contains several memory chips, such as on a memory module, a one-chip failure may produce a multi-bit error, such as a 4-bit error in a 72-bit memory word.
Additional bits are often included in the memory for storing an error-correction code (ECC). These additional ECC bits can be used to detect an error in the data bits being read, and can sometimes be used to correct those errors. Typically, a code is selected such that the data is unmodified. Error detection and correction is performed by comparing the check bits read against the correct check bits for that data. Such a code is considered in “systematic form”.
Various codes can be used for the ECC bits, such as the well-known Hamming codes. A class of codes known as Single-byte Error-Correcting/Double-byte Error-Detecting (SbEC/DbED) codes can correct any number of errors within a “byte” and detect pairs of such errors. The “byte” may be a length other than 8 bits. For example, a S4EC/D4ED code can correct 4-bit (nibble) errors, and detect but not correct 8-bit (2 nibble) errors. These codes are especially useful since they can detect double-chip errors where all 4 bits output by two different memory chips are faulty. Single-chip errors can be corrected.
A SbEC/DbED code with 3*b check bits can be used with up to b*(2**b+2) total bits (data+check). These are known as Reed-Solomon SbEC/DbED codes. When b=4, only a relatively small a number of data bits can be used (60). To increase the allowed number of data bits, 4*b check bits are typically used, such as 128 data bits with 16 check bits. The increased number of check bits allows a larger number of data bits to be used.
While such S4EC/D4ED codes are useful for protecting against failures in whole memory chips, and in the wires to and from the memory chips, failures can also occur in the address lines to one or more of the memory chips. For example, a solder connection to an address pin of one of the memory chips might start failing after some time. Many memory chips use multiplexed addresses, where the address is applied over the same address lines in two parts, a row address part and a column address part. A single solder connection can thus cause two bits of the address to be faulty. It is desirable to protect against such 2-bit address errors. Some of the memory errors may be caused by cosmic radiation. This may cause a wrong address to be read from within the memory chip. This address may be wrong in an unknown number of bits.
As memory sizes increase, more and more address bits are used. Protecting these larger addresses against errors becomes more important.
FIG. 1 shows a prior-art memory with data ECC and address parity. Write data is stored in data RAM 10, while ECC generator 16 calculates the ECC bits that correspond to the value of the data bits being written into data RAM 10. These data ECC bits are written into data ECC RAM 12 at the same write-address W_ADR as the data.
During reading, the read address R_ADR is applied to read out data from data RAM 10 and data ECC bits from data ECC RAM 12. Read ECC generator 20 regenerates an ECC value from the data being read from data RAM 10. The new ECC value from read ECC generator 20 is compared to the stored ECC bits from data ECC RAM 12 by ECC checker 24 to determine if any errors occurred in the read data. A data error can be signaled when the stored ECC does not match the re-generated ECC. Some of these data errors may be corrected by an ECC corrector (not shown).
To protect against errors in the address, the write address W_ADR is applied to parity generator 18, which generates the parity of the write address. The generated address parity is then stored in address parity RAM 14 at the write address.
During reading, the stored address parity is read from address parity RAM 14, while the parity of the read address R_ADR is generated by read parity generator 22. The generated read-address parity is compared to the stored parity from address parity RAM 14 by parity comparator 26. When the parity values mis-match, and address error is signaled. The memory read can be re-tried several times before a failure is signaled.
FIG. 2 shows address parity concatenated with data ECC bits. The address parity and data ECC bits can be stored in separate RAMs, or can be concatenated and stored in the same RAM. A data word of 128 bits may need 16 data ECC bits to correct errors up to 4 bits in a nibble and to detect pairs of such errors in separate nibbles. A 32-bit address protected with a standard Hamming code would need 6 bits, allowing detection of all 1 and 2 bit errors in the address. Thus a total of 22 check bits are needed to protect against both address and data errors.
Some memories may lack a sufficient width to store all of the check bits. For example, there may only be space for 16 check bits. It may be undesirable to reduce the number of data ECC bits to fit in some address parity bits. There are trade-offs among the number of check bits and expense of the memory system, the largest multi-bit data error that can be corrected and detected, and the degree of detection of address errors. Adding additional check bits for the address parity is often undesirable. Reducing the number of address check bits can reduce detection for multi-bit address errors. The use of multiplexed address bits causes 2-bit address errors to be as likely as 1-bit address errors in a real system.
The address parity bits could be exclusive-OR'ed (XOR'ed) into the data ECC bits. This has the advantage of not requiring additional check bits. However, if the address has a parity error, the extracted data ECC bits may not be able to correct an otherwise correctable data error. Thus some data correction ability may be lost. This happens if the address error causes an error syndrome to be created that matches the error syndrome for an otherwise correctable data error.
The parent application solved this problem by generating a more complex cyclical-redundancy-check (CRC) code. CRC codes are characterized by a generator polynomial. CRC codes have well-known benefits for increased error coverage, for a given number of check bits. The benefits include better coverage for random numbers of errors, and better coverage for errors that occur in consecutive bits (bursts).
The address CRC bits were merged into two nibbles of the data ECC bits. Since the address check bits were merged with the data ECC bits, additional bits were not needed for storing the address check bits.
FIG. 3 shows generation of a combined data and address check word according to the parent application. Data to be written to memory is input to data ECC generator 32. In this example 16 bytes (128 bits) of write data W_DATA are input, but other widths are contemplated. Data ECC generator 32 generates a S4EC/D4ED ECC code that can correct errors of 1-4 bits, and detect but not correct errors from two groups of 1-4 bits in the 128-bit data. Various strategies are used to generate this type of ECC code. Data ECC generator 32 outputs 16-bit data ECC codeword 36, which has four nibbles DE3, DE2, DE1, DE0.
The address to write the data to, W_ADR, is a 32-bit address. The write address is applied to CRC-code generator 34, which uses a generator polynomial to operate on the address, which is also represented as a polynomial, to generate a 4-bit output, labeled AE, address error check bits 38. The CRC generation is performed in modulo-2 arithmetic, which causes the logic function to be a series of XOR's.
Address error check bits 38 (AE) are merged with two of the four nibbles of data ECC codeword 36. XOR gates 44 merges the 4 bits of address error check bits 38 with the lowest-order nibble DE0 of data ECC codeword 36 to generate merged ECC nibble XE0 of merged ECC codeword 30. XOR gates 42 redundantly merges the 4 bits of address error check bits 38 with the next-lowest-order nibble DE1 of data ECC codeword 36 to generate merged ECC nibble XE1 of merged ECC codeword 30.
The upper two nibbles of data ECC codeword 36 are copied to the upper two nibbles of merged ECC codeword 30. Thus merged ECC codeword 30 contains two unaltered data ECC nibbles that contain only data ECC information and two merged nibbles that contain both data ECC and address check information.
While merging the address CRC with data ECC bits according to the parent application is useful, the particular code shown in the example of FIG. 3 has 16 check bits for 128 data bits. Some memories are accessible as 128-bit data words, but others are accessible as smaller 64-bit data words. These 64-data-bit memories may be constructed from smaller-width memory modules. For example, a memory may be organized to have 64 data bits and use smaller-width memory modules. A code optimized for these smaller memory modules is desirable. Separation of address and data check bits is also desirable to better trace back address errors.