1. Field of the Invention
The present invention relates to techniques for handling errors in computer systems. More specifically, the present invention relates to a method and an apparatus for classifying memory errors in a computer system.
2. Related Art
As computer memories grow increasingly larger, and individual memory cells become progressively smaller, it is becoming considerably more likely for errors to occur in computer memories due to natural phenomenon such as cosmic rays. Furthermore, as computer systems continue to increase in speed, data must be transferred at faster rates between processor and memory. This creates yet another source for data errors because faster data rates increases the likelihood of errors while transferring data between processor and memory.
Computer systems typically use error correcting codes to detect and correct memory errors. This is usually involves storing error-correcting code (ECC) bits along with each data word in memory, and then transferring the ECC bits along with a data word when the data word is transferred between main memory and the processor (or associated cache memory). Commonly used error-correcting codes typically support double-error detection and single-error correction for each data word. Hence, computer systems are generally able to detect double-bit errors and correct single-bit errors in a data word retrieved from main memory.
Some computer systems go one step further and provide mechanisms to determine the cause of a memory error. For example, if a correctable error is encountered while reading a data word from main memory, the computer system can read the data word a second time to determine the cause of the memory error. If the error does not occur during the second read, the system can determine that the error is an “intermittent error,” which can be caused, for example, by transient noise on the data lines between the processor and main memory.
On the other hand, if the second read also encounters an error, the computer system can use its ECC circuitry to correct the data word and write it to main memory. Then, to determine the cause of the error, the computer system can read the data word for a third time. If the third read also encounters the error, the system can determine that the error is a “sticky error,” which, for example, is caused by a “stuck” bit in the data word in main memory. On the other hand, if the third read returns the corrected data word, the system can determine that the error was a “persistent error,” which could have been caused by a change in the state of the data word in main memory.
Unfortunately, the above-described mechanisms to determine the cause of a memory error can become very complicated in the presence of cache memories. Note that, a cache memory mediates access to the main memory. However, in doing so, a cache memory interferes with attempts to retry an errant memory access, and can thereby interfere with the process of determining the cause of a memory error.
For example, in order to cause the computer system to perform a second read operation to a memory location, it is first necessary to flush the cache line associated with the memory location from the cache, so that the read operation will actually force a cache line to be retrieved from main memory. However, if the cache line is dirty when this flush takes place, the flush will cause the cache line to be stored back to memory, which may correct the error. Hence, the subsequent second read from the memory may not encounter the error. This can result in the error being diagnosed as an intermittent error, even though the error was actually a persistent error. (This type of mis-diagnosed error is referred to as a “false intermittent” error.)
Furthermore, existing techniques cannot differentiate between certain types of memory errors. For example a “leaky cell” condition can arise in which a memory cell does not hold charge. In this case, the above-described mechanism will incorrectly determine that the error is a persistent error, instead of a leaky cell. Furthermore, errors can arise because a specific processor in a multiprocessor system is a “bad reader” or a “bad writer.” Neither of these types of errors can be diagnosed with existing techniques.
Obviously, effective remedial action can only be taken if the cause of the memory error can be determined accurately. For example, unless the memory error is accurately diagnosed, it is impossible to ascertain whether a part needs to be replaced, and if so, which part.
Therefore, what is needed is a method and an apparatus that accurately determines the cause of a memory error within a computer system without the above-described problems.