Error correction code (ECC) such as SEC-DED (single error correction, double error detection) has been successfully used to protect main memory. However, traditional hamming code based ECC is designed for a general fault model and its overhead is unnecessarily large for the stuck-at fault model. This is especially true when the probability of having multiple bit errors is high, as is the case with resistive memories (e.g., phase-change memory (PCM), spin-transfer torque random-access memory (STT-RAM), memristor, etc.). In an example scenario, many cells in a memory block might reach their write endurance limit simultaneously. To cope with many faults, a correspondingly stronger ECC would need to be employed, which would incur excessively large space and computation overheads. In fact, for NAND flash memory, also suffering write endurance limitation, ECC is required to correct 40 or more bits per 512-byte block. Subsequently, recently proposed error masking techniques for resistive memories combine microarchitectural and coding ideas to cut down overheads.
The exploration of ECC can be traced many years back. Among many ECC schemes, SEC-DED is widely used to protect dynamic RAM (DRAM) in main memory. Since DRAM errors are typically transient and occur infrequently, SEC-DED is adequate in most situations. On the other hand, resistive memories have different failure mechanisms and are subject to multiple bit faults that occur gradually over the lifetime of a chip. Consequently, it is necessary to deploy a multi-bit error correction scheme. Hamming code based BCH (Bose, Ray-Chaudhuri, and Hocquenghem) code is one such scheme. Yet, codes based on BCH are complex and expensive to implement. As a matter of fact, the complexity increases linearly with the number of faults to be tolerated.
There are three recent proposals that target specifically masking errors in resistive memories with higher auxiliary storage efficiency than traditional ECC techniques. First, ECP (Error Correcting Pointer) provides a limited number of programmable “correction entries.” A correction entry holds a pointer (address) to a faulty cell within the protected block and a “patch” cell that replaces the faulty one. When a faulty cell is detected, a new correction entry is allocated to cover the cell. A memory block is decommissioned when the number of faulty cells exceeds that of the correction entries. In essence, ECP provides cell-level spares to each block.
SAFER (Stuck-at-Fault Error Recovery) dynamically partitions a protected data block into a number of groups so that each group contains at most one faulty cell. When the value of the faulty cell is different from the intended value to be written, all cells in the group are written and read inverted. If the data block is to be partitioned into n groups, then SAFER allows log2 n “repartitions.” Repartitioning is done whenever a new fault is detected. Therefore, SAFER guarantees the recovery from log2 n+1 faults. Any additional fault is tolerated only if it occurs in a fault-free group. Otherwise, the block has to be retired. SAFER was shown to provide stronger error correction than ECC or ECP at the same overhead level.
Free-p (Fine-grained Remapping with ECC and Embedded-Pointers) combines error correction and redundancy, and as such, has two protection layers. First, it uses an ECC to mask faults within a data block. Second, when a block becomes defective, Free-p embeds a pointer within the defective block so that a redundant, non-faulty block can be quickly identified without having to access a separate remapping table. Free-p employs ECC to correct up to four hard errors in a data block of cache line size and relies on the operating system (OS) to perform block remapping.
PAYG (Pay-As-You-Go) is a resilient architecture proposed to decrease the storage overhead of auxiliary bits information required by error correction schemes (e.g. ECP and SAFER) targeting the recovery from stuck-at faults. Essentially, PAYG moves from a uniform allocation of auxiliary bits across the protected memory blocks to a dynamic on demand allocation. PAYG exploits the variability in lifetime that the memory blocks exhibit and assigns additional auxiliary bits to weaker blocks.
Although conventional techniques based on SAFER are superior to ECC and ECP, they remain limited in terms of the combination of overhead required and average number of faults tolerated before failure.