Electric or magnetic interference, e.g., background radiation, can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state, which is a “hard” error condition. Due to scaled-down manufacturing technologies that increase DRAM density and reduce the size of components on chips, and due to increasing processing burdens, e.g., in computation-intensive applications at data centers, memory in current computer systems can suffer significantly from interference, resulting in serious error conditions.
Current computers utilize error correction coding (ECC) based on Hamming code that corrects single-bit errors and detects double-bit errors within a codeword. That methodology hardly meets the current reliability requirements of computer memory. When error counts increase, the central processing unit (CPU) slows down the memory's performance. Consequently, the computer system's performance can be lower than 50 percent of the nominal standard.
Furthermore, manufacturers apply conventional redundancy repair, which involves testing all memory cell arrays and replacing an erroneous bitline or wordline with a redundant line, for example. But this type of redundancy repair is only applied at the beginning, before the memory is installed in a computer system. Thus, this procedure cannot solve hard errors that occur during online memory use.
In summary, current fault tolerance methods such as ECC and redundancy repair are expected to fix memory errors, but they have limited capability. Existing fault tolerance methods cannot meet the overwhelming demand at current levels of memory usage, especially levels experienced because of computation-intensive applications in data centers, for example.