The present disclosure generally relates to error correction in memories, and particularly to memory devices having adaptive multi-bit error correction capability and methods of operating the same.
A new class of emerging memory devices has been investigated to overcome the limitations on scaling in conventional dynamic random access memory (DRAM) devices. These memory devices are resistive memory devices of which the status is determined by the resistance level of a variable resistance element. Examples of such resistive memory devices include phase change memory (PCM) devices, memristor devices, magnetic random access memory (MRAM) devices, and spin torque transfer random access memory (STT-RAM) devices. Resistive memory devices can provide increased scalability and higher density than traditional DRAM devices.
However, most resistive memory devices are prone to limited write endurance. Endurance is the maximum number of writes that a memory cell can tolerate before failure. For example, PCM devices typically provide only up to about 108 write operations. During the operation of a typical PCM device, as writes result in repeated expansion and contraction of the chalcogenide alloy due to state change of the cell, there is a higher probability of the material physically detaching from the heating element resulting in the cell being permanently stuck-at a value. For this reason, stuck-at fault errors, which are hard errors in which the state of a memory cell is stuck at a single state irrespective of any write operations performed on the cell, are generally more prevalent than transient faults in the resistive memory devices.
Endurance variation in resistive memory devices tends to have no spatial correlation among neighboring cells. Endurance variation increases with technology scaling, i.e., with the decrease in the dimensions of the memory device. Without error correction mechanisms, the weakest cell dictates the lifetime of a memory device. Error correction mechanisms are necessary to extend the lifetime of a memory device beyond the first cell failure. As wear-out related faults gradually increase with time, single-bit error recovery schemes that are in-place are not sufficient. Therefore, multi-bit error recovery is needed to extend the lifetime of a memory device further.
Hamming coding is one of multi-bit error correction methods known in the art. Original (72,64) Hamming coding was devised for recovering from transient faults. For Single Error Correction Double Error Detection (SECDED), Hamming coding requires 12.5% of the size of each memory block as an overhead for error correction code (ECC) bits to be able to correct only one error.
Error-Correcting Pointer (ECP) was published by Schechter et al., “Use ECP, not ECC, for hard failures in resistive memories”, in Proceedings of the 37th annual international symposium on Computer architecture, 2010. The ECP method uses multiple fail pointers for each data block. Each fail pointer has the address of the failed bit in the given data block, and the additional bit storing the correct value. ECP schemes that recover from 6 fails with 61-bit overhead (11.9%) are known in the art.
Stuck-At Fault Error Recovery (SAFER) was published by Nak Hee Seong et al., “SAFER: Stuck-At-Fault Error Recovery for Memories, in Proceedings of the International Symposium on Microarchitecture, 2010, which is incorporated herein by reference. SAFER handles the growing stuck-at faults by dynamic partition and data inversion using readability and permanency of stuck-at faults. SAFER can provide recovery from minimum 6 fails to maximum 32 fails for a 512 bit memory block with a 55-bit overhead, which translates to 10.7% of the size of the memory block.