1. Field of the Invention
The present invention relates generally to memory protection, and more specifically to a technique for detecting errors in a memory device.
2. Description of the Related Art
This section is intended to introduce the reader to various aspects of art which may be related to various aspects of the present invention which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Semiconductor memory devices used in computer systems, such as dynamic random access memory (DRAM) devices, generally comprise a large number of capacitors which store binary data in each memory device in the form of a charge. These capacitors are inherently susceptible to errors. As memory devices get smaller and smaller, the capacitors used to store the charges also become smaller thereby providing a greater potential for errors.
Memory errors are generally classified as xe2x80x9chard errorsxe2x80x9d or xe2x80x9csoft errors.xe2x80x9d Hard errors are generally caused by issues such as poor solder joints, connector errors, and faulty capacitors in the memory device. Hard errors are reoccurring errors which generally require some type of hardware correction such as replacement of a connector or memory device. Soft errors, which cause the vast majority of errors in semiconductor memory, are transient events wherein extraneous charged particles cause a change in the charge stored in one of the capacitors in the memory device. When a charged particle, such as those present in cosmic rays, comes in contact with the memory circuit, the particle may change the charge of one or more memory cells, without actually damaging the device. Because these soft errors are transient events, generally caused by alpha particles or cosmic rays for example, the errors are not generally repeatable and are generally related to erroneous charge storage rather than hardware errors. For this reason, soft errors, if detected, may be corrected by rewriting the erroneous memory cell with correct data. Uncorrected soft errors will generally result in unnecessary system failures. Further, soft errors may be mistaken for more serious system errors and may lead to the unnecessary replacement of a memory device. By identifying soft errors in a memory device, the number of memory devices which are actually physically error free and are replaced due to mistaken error detection can be mitigated, and the errors may be easily corrected before any system failures occur.
Memory errors can be categorized as either single-bit or multi-bit errors. A single bit error refers to an error in a single memory cell. Single-bit errors can be detected and corrected by standard Error Code Correction (ECC) methods. However, in the case of multi-bit errors, which affect more than one bit, standard ECC methods may not be sufficient. In some instances, ECC methods may be able to detect multi-bit errors, but not correct them. In other instances, ECC methods may not even be sufficient to detect the error. Thus, multi-bit errors must be detected and corrected by a more complex means since a system failure will typically result if the multi-bit errors are not detected and corrected.
Regardless of the classification of memory error (hard/soft, single-bit/multi-bit), the current techniques for detecting the memory errors have several drawbacks. Typical error detection techniques rely on READ commands being issued by requesting devices, such as a peripheral disk drive. Once a READ command is issued to a memory sector, a copy of the data is read from the memory sector and tested for errors en route to delivery to the requesting device. Because the testing of the data in a memory sector only occurs if a READ command is issued to that sector, seldom accessed sectors may remain untested indefinitely. Harmless single-bit errors may align over time resulting in uncorrectable multi-bit errors. Once a READ request is finally issued to a seldom accessed sector, previously correctable errors may have evolved into uncorrectable errors thereby causing unnecessary data corruption or system failures. Early error detection may significantly reduce the occurrences of uncorrectable errors and prevent future system failures.
Further, in redundant memory systems, undetected memory errors may pose an additional threat. Certain operations, such as hot-plug events, may require that the system transition from a redundant to a non-redundant state. In a non-redundant state, memory errors which were of little concern during a redundant mode of operation, may become more significant since errors that were correctable during a redundant mode of operation may no longer be correctable while the system operates in a non-redundant state.
The present invention may address one or more of the concerns set forth above.