The present invention generally relates to memory devices for use with computers and other processing apparatuses. More particularly, this invention relates to error checking and correction (ECC) in a memory subsystem of a processing apparatus.
Historically, most memory devices employed in computer systems are “imperfect,” that is, individual memory cells of a memory device may exhibit flaws that result in data corruption. In most cases, the flaws were hardware-related, resulting in reproducible “hard errors” that occur every time data are stored in the respective memory cell. These cells can typically be remapped, that is, excluded from further use once recognized through testing. As a nonlimiting example, bad blocks containing flawed memory cells are typically present in NAND flash memory chips. Bad NAND flash memory blocks are typically recognized and flagged to be excluded from use for data storage. With better implementation of process technology, at least in the area of dynamic random access memory (DRAM) or static random access memory (SRAM), the occurrence of bad cells is minimal and can be mitigated by over provisioning, that is, having spare rows or columns as a substitute for those having stuck bits.
A different type of error, generally referred to as a “soft error,” is not reproducible and occur spontaneously. In this type of error, primarily attributed to a bitline being hit by alpha particles or cosmic ray, a given bit value in a memory cell can suddenly flip and cause data corruption even though the memory cell in question has functioned correctly in the past and subsequent tests do not show any occurrence of errors at the same address. The occurrence of soft errors is referred to as soft error rate (SER) and depends on environmental factors such as elevation above sea level and/or occurrence of alpha particles in the near environment, the process geometry and, last but not least, the shielding of the memory device. The mechanisms of soft-error generation are somewhat complex but the general consensus is that primarily open pages are affected. That is, soft errors occur in pages wherein the passgate transistor to the DRAM cell is open or wherein the sense-amplifiers are in the process of amplifying the charge of the cell in order to determine the bit value. Memory cells in closed pages, in contrast, are rather immune to soft errors since they are “protected” by the gate transistor.
According to the above, soft errors only occur in areas of a memory array that are being accessed. Soft errors can also occur in parts of a memory array that are not programmed, in which case their occurrence would not only be inconsequential but also go completely unnoticed. However, if a soft error occurs in an area of memory holding valid data, either instructions or data can be affected. In the first case, the error will most likely cause a program to crash. In the second scenario, the error will cause some data corruption. Depending on the application, a soft error (in most cases, a single altered bit) may result in the change of a pixel value in an image file, a change in geometry in a CAD file, an incorrect tone in an audio file, or an incorrect character in a document. Most of these errors will go unnoticed and have no further consequences because of their transient nature. However, in the case of documents or financial records, single bit errors can have serious repercussions in that they may corrupt an entire data base through proliferation over iterative computations.
Particularly vulnerable applications include simulation and financial data bases, where a single bit error can cause a floating point shift or have other catastrophic consequences. In this type of environment, an error check, or rather data integrity check combined with an error correction mechanism is of utmost importance. The typical term for describing this mechanism is error checking and correction (ECC). In the case of system memories of personal computer systems, including PC architecture-based servers, ECC is implemented in a rather simple manner. In addition to the typically 64-bit wide data bus, an extra 8-bit bus is added where each bit is encoded with the checksum of one byte (8 bits). The total bus width in this case is 72 bits. FIG. 1 schematically represents an example of a conventional ECC-enabled memory module 10, in which eight data memory integrated circuit (ICs) devices (chips) 12 are arranged in an x8 configuration and combined with a “parity” chip 14 (x8) for the checksum bits. The module uses a total of seventy-two input/output (I/O) pins 16 to accommodate the sixty-four data bits and eight parity bits. When data are written to the memory module 10, a memory controller (not shown) performs an XOR operation to establish the checksum of each segment according to Hamming or other established codes, and the combined checksum of the entire transaction is stored in the parity chip 14. When the data are read back, the memory controller also requests the checksum from the parity chip 14 and, after reading the data, performs the same parity calculation as on a read and then compares the resulting checksum or parity value with the parity value read in the same transaction. In the simplest case, this will show whether there is any single bit error. Depending on the algorithm used to generate the checksum, double-bit errors may also be detected and single-bit errors may be corrected “on the fly.”
The advantage of this type of ECC is its speed and low payload on the memory controller. However, a disadvantage is that the entire ECC functionality is very rigid and requires hardware resources, including extra data lines (traces) and additional memory chips, which go unused if ECC is turned off. Moreover, with increasing error rates, this type of ECC cannot be upgraded to better suit the system needs. Accordingly, a more flexible architecture and ECC implementation would have far reaching benefits.