Modern computer memories use error correcting codes (ECC) to enable correct data to be recovered in spite of the presence of occasional errors. Errors are classified as either hard or soft, depending on whether the error is permanent or transient. A stuck bit that always reads as “0” no matter what is written into it would be an example of a hard error. A bit that was written as “1” but happens accidentally to get read back as “0” would be an example of a soft error.
The error rate is typically presented as a mean time between failures (MTBF) of whatever component is under consideration. Manufacturers publish values for the hard and soft error rate MTBFs of their memory products. For example, for a representative 1 gigabit memory module, a publication may list a soft error rate MTBF of 8 to 10 years and a hard error rate MTBF of about ten times that. This means that during 8 to 10 years of operation of this memory module, one should expect to encounter one bit that is read out as the wrong value.
Modern memories are based on dynamic random access memory chips (DRAM). DRAMs periodically refresh their memory cells. In a large memory, refreshes comprise the overwhelming proportion of operations performed over time in each DRAM chip. If a soft error occurs during a refresh operation, or during a write operation, the corrupt (i.e., erroneous) bit value will be stored back into a memory cell and thus the corruption resulting from the error will persist. Subsequent, non-faulty operations will correctly read the corrupt value. To prevent the occurrence of such errors, known memory systems employ an error correction code (ECC) so that when a corrupt value is read the correct data is recoverable. But since corruption persists in memory, subsequent soft errors may eventually further corrupt an already corrupt value. Since there is a limit to the amount of corruption that an ECC can correct, it is desirable to periodically check all data in memory, recover the correct data corresponding to any corrupt value, and repair the corruption by storing the correct data back in memory. As used herein, the term “scrubbing” refers to a process of checking all data in memory and repairing corruption.
A memory is typically organized as an array of words. Each word may be considered an error correction unit that includes some number of data bits and some number of error correction bits. Depending on the particular ECC used, some set of patterns of corrupt bits can be corrected and some set of errors can be detected. Often, the set of errors that can be detected by a particular ECC is larger than the set that can be corrected. For example, a typical ECC detects a single and double corrupt bits (i.e., up to two corrupt bits per word) but is only capable of correcting a single corrupt bit (per word).
There are several known methods for repairing corrupt data discovered during scrubbing. In one such method, the CPU, or other processor, writes all words back to memory. While this method is simple, errors are presumably infrequent, and therefore most of the writing back is unnecessary. In another known method, the memory controller remembers the address of a word whenever it corrects a corrupt memory word. When the CPU learns of the address of a corrupt word, typically via an interrupt, it repairs the corrupt word by reading the word from memory and writing the word back to memory. See U.S. Pat. No. 5,978,952 to Hayek et al. Care must also be taken under this method to guarantee that all corrupt words uncovered during the scan are in fact repaired, that is, that the scan is complete. In a third approach, the memory controller itself writes back the corrected data when corruption is encountered. See U.S. Pat. No. 6,101,614 to Gonzales et al.
Generally, scrubbing methods contain some arrangement to guarantee that the read and write back of corrected data is an atomic operation in order to guarantee that no other update to the corrupt word being repaired can insert itself between the read and the write back. Such an arrangement under the approach of Gonzales et al. is the subject U.S. Pat. No. 6,076,183 to Espie et al.