Referring to Prior Art FIG. 1, a functional block diagram of a prior art memory that uses scrubbing to correct detected errors, is shown. Scrubbing is a method of using extra bits of information, that is, redundant information, added to the data itself to identify whether the data has any errors and to provide the opportunity to correct it with a background task that periodically inspects memory for errors, and then corrects the error using a copy of the data. It reduces the likelihood that single correctable errors will accumulate; thus, reducing the risk of uncorrectable errors. Examples of redundant information include parity bits and error correction code (ECC) bits associated with that data. Hamming codes are popular ECC codes that can be used to detect and correct a single-error (single error correction, SEC) in a word, and perform double error detection (DET). Such a code cannot perform double error correction, as there is insufficient information in the ECC to locate exactly which bits have the error. For example, a Hamming (7, 4) code encodes 4 data bits into 7 total bits, e.g., with 3 bits of parity for a SECDEC ECC. Scrubbing utilizes the ECC for SEC. A memory can be checked for errors by reading the data with the parity bits and operating the ECC algorithm to detect and correct a single-bit error. The corrected data, along with the parity bits, can then be written back into memory as corrected data, thus scrubbing out the original data error.
Scrubbing is useful for checking memory for single bit errors, but it is not effective at correcting more than a single bit of the data associated with it. Single bit errors might arise because of a weak memory cell, e.g., leaky gates, or due to a single upset event, e.g., a random alpha particle hit (APH) causing a soft error by flipping a bit. Scrubbing is helpful at resetting these random flipped bits, due to random soft errors. A weak memory cell, however, while possibly intermittent, will return faulty data repeatedly. Even though ECC could correct for a single weak memory cell, there is a risk that a random soft error could appear in a word that also has a weak memory cell before the scrubbing corrects either one of the errors. This could result in two or more bit errors occurring for a data string associated with the ECC resulting in an unrecoverable error. At that point, the errors for that given portion of data will not be correctable, and a frame or packet may be dropped, or an interrupt or resend request may be needed, or in the worst case, the system may crash. Examples of double-bit errors include one weak cell in the same portion of memory as another memory cell suffering an APH, or in the same portion of memory as a newly arising second weak cell.
In such a circumstance, memory can be tested by taking the chip off line and performing a test, causing a system interrupt and down time. A chip may pass test but be deemed of insufficient reliability to continue service. It may otherwise be judged unserviceable, due to the unpredictability of its performance, the perceived future degradation threat against needed system reliability and up time, or simply because of a lack of redundant memory resources (RMR), by prior consumption of the RMR or insufficient capability of the RMR.