Data processing devices, such as computer servers, are sometimes used in environments where outages can cause major disruptions to operations. Such outages can be caused by memory failures. Accordingly, it is typically desirable to design the data processing device with sufficient redundancy so the device can continue operations even when a particular memory module fails. Some data processing devices employ error correcting codes (ECC) to improve memory reliability.
ECC's typically use Reed-Solomon codes which over-sample a polynomial constructed from the data. The polynomial evaluation is called the check field and is saved with the data in memory. The check field provides for reconstruction of the original data if part of the data, or the check field itself, is lost or garbled. Data is organized in groups of bits called symbols. Loss of any or all bits in a symbol may be recovered. Typically, all data bits from each memory chip are fully contained in a symbol, so loss of any or all bits of a memory chip is fully recoverable. Memory chip width thus determines symbol size.
In particular, when a unit of data (referred to as a data word) is stored in memory, a memory controller calculates a set of checkbits (the check field) based on the value of the data being stored and stores the set of checkbits in memory along with the data. When the data word is requested from memory, the memory controller retrieves the data stored at the data word address and calculates a new set of checkbits. The memory controller compares the new set of checkbits to the stored set of checkbits, whereby a difference in the sets indicates an error in the stored word. In particular, in the event of an error the comparison of checkbits identifies the symbol in the data word where the error is located and which bits in the symbol are to be corrected.
The number of errors in a word that can be detected and corrected depends on the number of checkbits associated with the data word. This is determined by memory system geometry and is associated with intrinsic system characteristics such as cache line size. Cache line size cannot be changed without potentially affecting correct operation of existing programs. For example, in x86 servers with 64-byte cache line size, two 9 byte (72 bit) memory channels are typically coupled to provide 18 byte (144 bit) memory width. Memory chips typically provide data across a 4-beat burst, resulting in each access providing 72 bytes. This is organized as 64 bytes of data and 8 bytes (64 bits) of check bits.
x86 servers employing 4-bit memory chips typically organize ECC with 16 checkbits for each 128-bit data word, with each data word including 36 data symbols with 4 bits per symbol. Codes are often designed with an additional symbol for RAS (Reliability, Availability, and Serviceability). Typical codes provide correction of all single-symbol errors and guarantee detection of all double-symbol errors, providing correction of all single memory chip failures and detection of additional single-bit errors. Increasing symbol size for 8-bit memory chips results in 18 data symbols with 8-bits per symbol. Such an ECC is capable of correcting all single-symbol errors but cannot reliably detect all double-symbol errors. Theory shows that 6.67% of all double-symbol errors will be detected as a single-symbol error, resulting in an error misdetection and miscorrection. That value is too high to be acceptable in enterprise-class servers. Although the likelihood of error misdetection can be eliminated by increasing the number of checkbits associated with a data word, this undesirably increases memory size and is incompatible with cache line size.
The probability of error misdetection can also be reduced by interleaving the bits of multiple data words prior to transmitting the bits to the memory controller for error detection. The data words are reassembled at the memory controller for error detection and correction. Interleaving of the data words reduces the likelihood that a transmission error will cause multiple errors in a single data symbol. However, interleaving undesirably increases memory access latency. Accordingly, an improved method and device for correcting errors in stored data would be useful.