Magnetoresistive random-access memory (“MRAM”) is a non-volatile memory technology that stores data through magnetic storage elements. These elements are two ferromagnetic plates or electrodes that can hold a magnetic field and are separated by a non-magnetic material, such as a non-magnetic metal or insulator. This structure is known as a magnetic tunnel junction (“MTJ”). FIG. 1 illustrates an exemplary MRAM cell 110 comprising a MTJ 120. In general, one of the plates has its magnetization pinned (i.e., a “reference layer” or “fixed layer” 130), meaning that this layer has a higher coercivity than the other layer and requires a larger magnetic field or spin-polarized current to change the orientation of its magnetization. The second plate is typically referred to as the free layer 140 and its magnetization direction can be changed by a smaller magnetic field or spin-polarized current relative to the reference layer.
MRAM devices can store information by changing the orientation of the magnetization of the free layer. In particular, based on whether the free layer is in a parallel or anti-parallel alignment relative to the reference layer, either a “1” or a “0” can be stored in each MRAM cell as shown in FIG. 1. Due to the spin-polarized electron tunneling effect, the electrical resistance of the cell change due to the orientation of the magnetic fields of the two layers. The electrical resistance is typically referred to as tunnel magnetoresistance (TMR) which is a magnetoresistive effect that occurs in a MTJ. The cell's resistance will be different for the parallel and anti-parallel states and thus the cell's resistance can be used to distinguish between a “1” and a “0”. One important feature of MRAM devices is that they are non-volatile memory devices, since they maintain the information even when the power is off. The two plates can be sub-micron in lateral size and the magnetization direction can still be stable with respect to thermal fluctuations.
MRAM devices are considered as the next generation structures for a wide range of memory applications. MRAM products based on spin torque transfer switching are already making its way into large data storage devices. Spin transfer torque magnetic random access memory (“STT-MRAM”), such as the one illustrated in FIG. 1, or spin transfer switching, uses spin-aligned (“polarized”) electrons to change the magnetization orientation of the free layer in the magnetic tunnel junction. In general, electrons possess a spin, a quantized number of angular momentum intrinsic to the electron. An electrical current is generally unpolarized, e.g., it consists of 50% spin up and 50% spin down electrons. Passing a current though a magnetic layer polarizes electrons with the spin orientation corresponding to the magnetization direction of the magnetic layer (e.g., polarizer), thus produces a spin-polarized current. If a spin-polarized current is passed to the magnetic region of a free layer in the magnetic tunnel junction device, the electrons will transfer a portion of their spin-angular momentum to the magnetization layer to produce a torque on the magnetization of the free layer. Thus, this spin transfer torque can switch the magnetization of the free layer, which, in effect, writes either a “1” or a “0” based on whether the free layer is in the parallel or anti-parallel states relative to the reference layer.
Spin transfer torque magnetic random access memory (“STT-MRAM”) has an inherently stochastic write mechanism, wherein bits have certain probability of write failure on any given write cycle. The write failures are most generally random, and have a characteristic failure rate. A high write error rate (WER) may make the memory unreliable. The error rate can typically increase with age and increased use of the memory. Bit-errors can result in system crashes, but even if a bit-error does not result in a system crash, it may cause severe problems because the error can linger in the system causing incorrect calculations and multiply itself into further data. This is problematic especially in certain applications, e.g., financial, medical, automotive, etc. and is generally commercially unacceptable. The corrupted data can also propagate to storage media and grow to an extent that is difficult to diagnose and recover.
Accordingly servers and other high reliability environments have conventionally integrated Error Correcting Code (ECC) into their memory subsystems to protect against the damage caused by such errors. ECC is typically used to enhance data integrity in error-prone or high-reliability systems. Workstations and computer server platforms have buoyed their data integrity for decades by adding additional ECC channels to their data buses.
Typically ECC adds a checksum stored with the data that enables detection and/or correction of bit failures. This error correction can be implemented, for example, by widening the data-bus of the processor from 64 bits to 72 bits to accommodate an 8-bit checksum with every 64-bit word. The memory controller will typically be equipped with logic to generate ECC checksums and to verify and correct data read from the memory by using these checksums. In conventional memories using STT-MRAM error correction an error correcting code (ECC), e.g., BCH (Bose-Chaudhuri-Hocquenghem) is used to correct errors.
While conventional error correction, e.g., ECC are effective, they have certain drawbacks. For example, the error correction using ECC is not performed in real-time. In other words, the ECC correction may be performed during a read operation, but the error is not corrected as the data is written into the STT-MRAM memory cell.
Further, other conventional error correction schemes may require considerable overhead because the addresses/locations of all the bad bits in the memory chip need to be stored prior to performing the correction. The Content Addressable Memories (CAMs) required to store such addresses and locations occupy significant surface area and are expensive because of the high overhead involved in saving the bit addresses/locations for all the failing bits. Storing each address of a defective bit in a CAM also acts as a limit on the number of addresses that can potentially be stored. Further, storing addresses of bad bits and then replacing them with good bits is also not an optimal scheme for STT-MRAM memories because the defect rate is typically high and too much memory would be required to store the addresses of all the bad bits. Also, this error mitigation scheme does not work for defects that are discovered on-the-fly (e.g. replacing the bad bits with good bits may have only happened at the tester phase in manufacturing).
Further, typically, error schemes like ECC can detect and correct errors during a read operation, but it does not write the data back into the memory array. This behavior causes the error to stay resident inside the memory array across multiple accesses and may contribute to a memory failure at a later time when additional errors occur. For example, if the memory is used for longer periods of time, there is an increased probability of a second failure occurring in the same ‘word’ as a first failure. The first failure may lie silently for years as the internal ECC logic repairs the error every time the word is read. When a second (or third or fourth . . . ) error hits the same word, the internal ECC circuitry is unable to repair the word and corrupted read data is provided to the system.
Additionally, ECC is not efficient for correcting high fixed defect rates. This is particularly problematic for memories comprising STT-MRAM that typically have higher failure rates as compared to other memories. FIG. 2 illustrates the number of codewords with less than 1 bit ECC left reserved as a function of the defect rate. As seen in FIG. 2, for a 1% defect rate, using a BCH-3 ECC scheme, over a 100 words need repair. Conventionally, ECC is appropriate for applications where the defect rates are approximately 50 parts per million (ppm) or less. For memories with higher defect rates ECC and other error correction schemes become problematic. Accordingly, in memory applications comprising STT-MRAM where defect rates are higher, using only conventional error mitigation schemes like ECC results in inefficiencies.