Memory devices such as DRAMs may occasionally encounter errors and pose problems for computing platforms when writing to or reading from faulty memory cells. Reading bad data from a faulty memory cell causes problems such as a system crash and result in system downtime. DRAM errors may include single- or multi-bit hard errors or transient soft errors. As the silicon feature sizes used to develop memory devices decrease, and as the operating frequency and steady state operating temperature of memory devices increase, these errors become more frequent and cause accumulation of faulty memory cells. These memory errors can occur at any time such as after installing the memory device into the computing platform, at boot time, and while the memory device is operating in an on-line system.
To reduce system down time due to memory errors, several methods have been used to protect against faulty memory cells. These methods include algorithms such as error correction code (ECC), memory scrubbing, and adding redundant memory through memory mirroring, memory RAID, or memory sparing.
The method of using redundant memory requires healthy memory for both active and redundant regions. The health of memory is tested at each startup (boot) time. Such a rigorous memory test is destructive because it overwrites memory cells contents. Volatile memory such as DRAM does not contain useful data at startup, the destructive memory testing is inconsequential. In the normal practice, such destructive memory testing is not performed on the on-line volatile memory or while the computing platform is in operation.
In the absence of these tests, the errors that occur in the memory subsystem during normal operation remain hidden until a Read access from the failed memory location. At that time, if memory scrubbing is enabled, if the memory error is correctable, and if the error is transient, the system can restore its health by replacing the corrupted data with corrected data. In present practice however, memory scrubbing is enabled only for the active memory regions and does not protect the spare memory.
The spare memory is used to recover from a serious uncorrectable memory failure in the active memory regions. Therefore, after boot time, in the absence of any spare memory testing, the system can detect errors in the spare memory region only in response to excessive failures detected in the active memory regions. Such latent detection of failures can prevent the system from taking advantage of the redundant spare memory and leads to a premature system crash.