Traditionally, in Dynamic Random Access Memories (DRAMs), small weaknesses of some memory cells or external disturbances like electromagnetic or particle radiation can cause unavoidable random bit-flips. The error rate can typically increase with age and increased use of the memory. Bit-errors can result in system crashes, but even if a bit-error does not result in a system crash, it may cause severe problems because the error can linger in the system causing incorrect calculations and multiply itself into further data. This is problematic especially in certain applications, e.g., financial, medical, automotive, etc. The corrupted data can also propagate to storage media and grow to an extent that is difficult to diagnose and recover. Most DRAM errors are transient and disappear after rebooting the system, while the resulting damage lingers.
Accordingly servers and other high reliability environments are currently integrating Error Correcting Code (ECC) into their memory subsystems to protect against the damage caused by such errors. ECC is typically used to enhance data integrity in error-prone or high-reliability systems. Workstations and computer server platforms have buoyed their data integrity for decades by adding additional ECC channels to their data buses. Mainstream computing devices such as home computers, tablets, and smart phones rely on the low baseline bit error rate of commodity DRAM and do not implement robust, or any, error correction. When a DRAM data failure occurs in one of those devices, it causes silent corruption or potentially a device crash forcing a reboot.
Typically ECC adds a checksum stored with the data that enables detection and/or correction of bit failures. This error correction can be implemented, for example, by widening the data-bus of the processor from 64 bits to 72 bits to accommodate an 8-bit checksum with every 64-bit word. The memory controller will typically be equipped with logic to generate ECC checksums and to verify and correct data read from the memory by using these checksums.
Until now, DRAMs have not performed any error correction internal to the DRAM device. DRAM ECC has always been performed externally by the addition of data (more DRAM devices) to create a wider channel. As process nodes shrink, especially in the case of mobile applications, the stored charge per bit is becoming increasingly smaller and, therefore, more susceptible to both internal and external noise.
Non-volatile memories have an even higher likelihood of errors than DRAM. Those devices have added large numbers of additional bits per block to allow for the repair of errors. The repair itself, however, occurs in the flash memory controllers, not in the flash memory itself.
Hence, the DRAM industry is preparing to add ECC internal to the DRAM as they approach the 20 nm node and smaller process nodes. However, as DRAM vendors move towards integrating ECC, they are doing so only with the aim to correct bits during a READ operation and not to repair the internal arrays. Accordingly, for systems that wish to leave their devices powered for a longer period of time there is still a risk of data corruption leading to uncorrectable errors.