As the amount of memory and number of CPUs per computer system increases, the likelihood of memory errors also increases. As known, memory hardware can be built with the ability to correct data when a single bit is corrupted with features such as ECC (Error-Correcting Code), ChipKill from International Business Machines or ChipSpare from Hewlett-Packard. It is possible that more than one bit is corrupted within some fixed-size range, for example, a “window” of 256 bits as defined by the hardware implementation. The hardware, however, may not be able to recover and must signal to the operating system that it may no longer be safe to run due to a data corruption.
If the error did not corrupt internal CPU state, however, the operating system could choose to try to recover from the error. Within the class of recoverable errors, there are two further classifications: persistent and non-persistent (transient) errors. A transient error is one that happens just once and is often attributed to a cosmic ray collision as high-energy particles striking a memory chip can disturb the state of the RAM and cause the corruption. On the other hand, a persistent error is one where the memory hardware has failed and continues to corrupt the bit each time it is used.
In the event of an error, some known operating systems are able to kill or terminate the program or application that was using the memory, usually at a page level. One system that has done work in this area is Sun Microsystems' Solaris operating system. The contribution here is the ability to terminate processes affected by an uncorrected memory error. Additionally, the system will identify the memory, i.e., a memory page or pages, as not to be used and data will be stored at other locations. Suds ZFS file system also has the ability to repair silent data corruption. ZFS may have multiple copies of the same data and if one copy goes bad, ZFS uses a checksum to determine which remaining copy of the data is correct. This method does not, however, reconstruct the data when there was only a single copy and it is suspect itself.
Other operating systems panic the entire system immediately with just an error report. Ignoring the error report and continuing with no action being taken, however, risks corrupting user data or otherwise operating incorrectly.