Data storage, referred to generically herein as “memory,” is commonly implemented in computer systems. Computer systems may employ a multi-level hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the lowest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the highest level of the hierarchy. The hierarchy may include a fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. In addition, the computer system may use multiple levels of caches.
From time-to-time a defect may occur within a portion of memory. Such defect may occur and be detected during manufacturing (or “fabricating”) the memory, or such defect may be a latent defect that is not observed until after the memory chip has been supplied by the manufacturer. Latent defects may be caused, for example, by aging, stresses, and/or actual use of the memory, which results in errors from the point of view of the memory. Thus, latent defects refer to defects that were not present (or did not manifest themselves) during the testing and production of the memory. Some latent defects manifest themselves as hard errors which consistently fail when tested for. Other latent defects manifest themselves as erratic errors which fail inconsistently.
Latent defects in memory if not detected, corrected, or avoided, will cause a running program accessing that portion of memory to crash. Especially in systems expected to have high uptimes (high availability systems), this is not acceptable. In addition, it is not acceptable to severely limit the performance of such systems. Therefore there is a need for methods to detect, correct, or avoid latent defects in memory (whether they manifest themselves as hard or erratic errors) while not limiting performance of the system.