Computer systems are subject to a variety of problems that may cause errors in memory, ranging from flaws in memory circuitry to background radiation. In response to these known errors, computer memories are often designed such that a small number of isolated errors will not interfere with normal operation. These isolated errors, known as correctable errors, are first detected and then corrected by the computer system to prevent corruption of user data. Computer systems detect errors through the use of common detection techniques, such as using parity bits or through repetition schemes. Once an error is detected, most computer systems can correct the error using error-correcting codes or similar techniques. Those skilled in the art will appreciate the varied techniques that may be employed by computer systems to detect and correct errors in memory.
Generally, computer systems can account for a few isolated errors. However, if the number of correctable errors increases or the errors are not isolated, at some point an occurrence of two or more correctable errors may become uncorrectable. Uncorrectable errors occurring in computer memories often create significant problems. For example, the occurrence of an uncorrectable error may require a stoppage and a restart of the processing system in order to avoid corruption of the user data being processed. Further, memory systems that supply uncorrectable errors can no longer be relied upon to provide accurate data and, therefore, must be replaced or functionally isolated to prevent future occurrences of the uncorrectable errors.
Current approaches to predicting and preventing uncorrectable errors are costly and inefficient. One method currently employed to prevent uncorrectable errors is to provide redundant memory hardware that creates a backup of all stored memory. However, redundant hardware is costly, due both to the incremental cost of the hardware itself and to the additional management cost needed to manage the hardware. Another method currently used is to simply replace a memory system that provides correctable errors under the belief that the existence of correctable errors is an unequivocal warning that the memory system will provide an uncorrectable error sometime in the near future. However, this method is inefficient as not every correctable error is proof of a structural problem within the computer memory system. For example, the correctable error may have been caused by background radiation. Thus, a functioning memory system may be replaced or quarantined unnecessarily. The unnecessary replacement of computer memories has several drawbacks, including financial harm, both through the cost of the memory and the cost of labor for the replacement, system downtime while memory is replaced, and the negative impact on customer relations that results from having to replace memory systems.
What is needed is a way to accurately predict when an uncorrectable error will occur in the future so that proper steps can be taken to prevent the error without incurring the unnecessary financial costs of replacing a functioning computer memory.