1. Field of the Invention
The present invention relates generally to memory systems and, more particularly, to memory system testing.
2. Description of Related Art
As consumer electronics continue to advance in capability, decrease in size, and permeate virtually every aspect of modern life, an increasingly higher percentage of integrated circuit chip area is being dedicated to memory cells. Likewise, system level memory is also increasing in capability and capacity. This increase in memory size and capacity is being made possible, in part, by advances in semiconductor technology that allow more transistors to be integrated into smaller and smaller integrated circuit chips, while keeping power dissipation, and heat production, under control through lower voltage levels. As a consequence, modern memory systems are becoming more and more susceptible to failures due to functional imperfections in the chips incurred during chip fabrication, exposure to cosmic radiation during fabrication and operation, and other environmental conditions and stresses incurred during fabrication and/or operation.
The need for memory testing was well established even before recent increases in memory size and capacity. However, the recent increases have made memory testing even more critical. As a result, significant energy and resources have been devoted to memory testing in the recent past and numerous publications on the subject, and several prior art methods, have evolved.
In the prior art, tester-level memory testing was done by the memory chip, typically a DRAM chip, vendor, while system-level memory offline-mode testing like POST/BIST was commonly done by the computer system manufacturer during system manufacturing, new installation or reconfiguration. Generally, in the prior art, POST/BIST ran before the Operating System (OS) booted up on the computer system.
While prior art POST/BIST tests alone were effective at catching some faults, it is well known that different stress conditions like temperature, voltage, timing noise, and the like, also play an important role in creating, causing, and amplifying, certain failure mechanisms long after computer system boot up. Consequently, it is important to do some offline testing under simulated-real-life computer system environmental conditions, with the Operating System booted up and running, especially during the manufacturing of mission critical systems. However, even offline testing under simulated-real-life environmental conditions is not enough by itself because there is a constant Mean Time Between Fails (MTBF) associated with all components in a computer system, and particularly in a memory system. Therefore, faults can develop anytime during the operational life of the computer and/or memory system.
In addition, with decreasing physical memory cell area, and the decreasing amount of charge stored by the memory cell, i.e., decreasing threshold voltages, cosmic radiation is becoming more of a concern and radiation induced single and multiple bit errors are now common, even at sea level operation. Permanent damage due to single ion impacts is also known to occur in memory devices. It is therefore necessary to have some mechanism for online testing, i.e., testing under real-life system loads and conditions, and error handling to maintain the high computer and/or memory system availability now required in the industry.
In the prior art, system memory has typically also been protected by parity and/or Error Correction Code (ECC). While parity provides only single bit error detection, ECC is typically implemented as Hamming Code, providing Single bit Error Correction and Double bit Error Detection (SECDED) capabilities at the expense of additional redundant bits, as compared to the single bit required for parity. However, this prior art approach, when used alone, provides only passive protection and ECC error correction takes place only when data is read. Consequently, in the prior art, Uncorrectable Errors (UEs) often occurred due to accumulation of multiple single bit errors between successive reads. To reduce the probability of such accumulation, many prior art systems implemented memory scrubbers.
A memory scrubber periodically reads through all of system memory and if it finds any single bit errors (for SECDED ECC), corrected data is written back, effectively cleaning up the error. While in the prior art SECDED ECC, coupled with memory scrubber techniques improved protection against single upsets, there was still vulnerability to faults that could cause multi-bit errors. Unfortunately, ECC codes like DECTED (Double Error Correct, Triple Error Detect), that, in theory, provided correction of more than one bit, were found to be inefficient with the additional check bits themselves being susceptible to the same faults.
What is needed is a method for in-system memory testing that is capable of testing system memory in both offline and online environments, without imposing any additional hardware requirements or significantly affecting normal computer or memory system operation and performance.