As the use of servers for database and computer applications increases, the need also increases for robust systems that can detect failures in the system that have occurred and reduce or prevent errors and failures from occurring. One type of failure that can occur is memory errors that occur in physical memory, such as random access memory (RAM) or other types of memory. Bits of data stored in physical memory cells may be corrupted upon writing, reading, or during storage. For example, unexpected or unwanted changes in the value of a stored bit may occur somewhere in the memory when a bit suddenly and randomly changes state, resulting in errors in the data. Or, a noise pulse (electronic interference), crosstalk, or glitch in the circuits or busses of a device may occur and can be misinterpreted in memory as a data bit or address bit. Other errors can occur in the memory chips as a result of electromagnetic radiation, or radioactive decay in the atoms of the epoxy of the plastic chip package of the chip which causes a memory cell to change state. Sometimes, a part of a memory chip can physically fail, causing recurring errors, and rebooting the system does not alleviate the condition, thus requiring the memory chip to be replaced. “Soft” errors are those errors that generally result from transient events such as noise, crosstalk, or radiation, and may not indicate any serious or recurring problem with the memory at particular storage locations, while “hard” errors are those which result from a failure in the hardware which may permanently cause recurring errors. In recent years, as system memory is much increased in density (i.e. more memory is stored on fewer physical devices), the possibility for memory errors poses a far greater threat to system availability. Thus, protection against system memory failures becomes increasingly important.
To alleviate the effects of such errors, many computer systems such as servers employ schemes to detect and correct memory errors. Some of these schemes are called Error Correcting Code (or sometimes Error Checking and Correcting) (ECC). Commonly-used ECC schemes can typically detect and correct single-bit errors, where extra check bits are generated with the data as it is written to memory, and allow the system to check the bits to detect the presence of a single bit error, locate which bit is in error, and correct that single-bit error as the data is read from memory. The occurrence of the error is also recorded. This technique can thus fix single bit errors without halting or rebooting the system.
Many systems have focused on the detection and correction of single bit errors; however, multi-bit errors can and do occur. Multi-bit errors, such as double bit errors, are two or more bit errors occurring within a predefined storage unit, typically a byte. With shrinking geometries of memory circuits resulting from advancement in semiconductor process technology, the importance of multi-bit errors may be increasing relative to single bit errors. Once a single bit error occurs in a portion of memory, the probability that a double bit error will occur in that same portion of memory increases; this is because, if a single bit error occurs, it may indicate that that portion of memory is prone to noise errors or glitches, or will soon have a hardware failure.
Commonly-used ECC and other schemes allow for the detection and correction of single bit errors, and the detection of double bit errors in memory data. However, these commonly-used schemes are typically not be able to correct any double bit errors that are detected. Thus, if a single bit error is detected, that error is corrected and the memory is monitored for further errors, but if a double bit error is detected, then the system logs the error and immediately stops the system from processing to avoid data corruption. After the system is halted, the memory can be removed or replaced, and the system rebooted. Schemes exist for the correction of double-bit or multi-bit errors, but these are not commonly used.
The disadvantage of halting the system and its programs and rebooting the system after a double bit error occurs is that some applications running on the system are deemed “mission critical” and cannot adequately perform their intended function if interrupted. For example, heart monitoring equipment that is controlled by software should not be stopped due to memory errors and only resumed after rebooting the system, as rebooting heart monitoring software would leave a patient at risk while the reboot took place.
Accordingly, what is needed is an apparatus and method for reducing the occurrence of double bit memory faults in computer systems while running an operating system, without having to stop processing and reboot the system. The present invention addresses such a need.