1. Field of the Invention
The present invention relates the field of computer memory systems and the performance thereof.
2. Prior Art
Most computer systems include, among other things, substantial storage capacity in the form of random access memory, currently most commonly in the form of dynamic random access memory (DRAM). Such memories and systems incorporating such memories are known to be subject to certain types of errors. For instance, in the memory itself, the errors may be generally classified as either soft errors or hard errors. Soft errors are errors which occasionally occur, but are not repeatable, at least on a regular basis. Thus, soft errors alter data, though the stored data may be corrected by rewriting the correct data to the same memory location. A major cause of soft errors in DRAMs are alpha particles which, because of the very small size of DRAM storage cells, can dislocate sufficient numbers of electrons forming the charge determining the state of the cell to result in the cell being read as being in the opposite state. This results in a relatively randomly occurring, single bit memory error which, because of its very low likelihood of reoccurrence in the near future, can be corrected by rewriting the correct data to that memory location. Soft errors can also be related to noise in the memory system, or due to unstable DRAMs or SIMMs (DRAMs in the form of single inline memory modules).
Hard errors in the memory are repeatable errors which alter data due to some fault in the memory, and cannot be recovered by rewriting the correct data to the same memory location. Hard errors can occur when one memory cell becomes stuck in either state, or when SIMMs are not properly seated.
Silent failures are failures that cannot be detected by the system. For example, if a standby part fails inside a system having redundant parts, most systems will remain unaware of the failure. However, although the system is still functional, it has lost its redundancy as if the same had never been provided, and is now vulnerable to a single failure of the operating part. Soft errors and hard errors can be either be single bit or multiple bit memory errors, and can also be silent failures under certain conditions.
Currently, server systems manufactured and sold by Sun Microsystems, Inc., assignee of the present invention, are implemented with an error correction code (ECC) to protect the system from single bit memory errors. In the event of a single bit memory error in the data or the correction code as read from memory, the system automatically corrects the error before the data retrieved from memory is used. This is implemented using an 8-bit KANEDA error correction code for the 64-bit dataword of the memories, making the entire codeword 72-bits wide. The actual error detection and correction operation is done, for instance, by dedicated ECC circuitry as part of the processor module so that on the occurrence of a single bit memory error in the 72-bit codeword received from memory, the same will automatically be corrected before being presented to the processor. Also, upon the occurrence of a single bit error and the correction thereof by the ECC circuitry, the processor is alerted to that fact so that the processor will include the additional step of writing the corrected codeword (data and ECC) back to memory on the unverified assumption that the single bit error was a soft error. In such systems, the I/O of the system consists of a 64-bit word, the applicable ECC code being tacked onto any dataword before the resulting 72-bit codeword is written to memory.
Also, in the current systems of the type described, an automatic reset is initiated upon the occurrence of a double bit memory error. This, of course, results in an interruption of service by the system, loss of any ongoing communication, and loss of data. Because a double bit error is a rare event under normal operating conditions, such system failures caused by double bit memory errors are also rare. However, normal operating conditions may be defined as operation without excessive memory errors occurring in the system, wherein the ECC implementation described provides adequate protection for the integrity of the system memory. But two events can change a normal operating condition into an abnormal operating condition, specifically that (1) the memory subsystem has excessive single bit soft errors, and (2) the memory subsystem has single bit hard errors. These occurrences obviously greatly increase the probably that a normally expected soft error will become a second bit error causing automatic interruption of the system.
In the current ECC implementation, no memory error log is visible to the system administrator. Thus, whenever there is a single bit memory error, the system simply corrects it and continues to run. Under normal operating conditions, protecting the system from single bit errors is the purpose of the ECC. Under abnormal operating conditions, the ECC actually masks the underlying problem. When the memory subsystem has either excessive single bit soft errors or single bit hard errors, they become silent failures in the current ECC implementation. The system then becomes prone to single bit errors so that an additional single bit memory error combined with the silent failure may result in a double bit error, bringing the system down.