Embodiments of the present invention relate in general to computer memory, and more specifically to a method for handling corrected memory errors on kernel text.
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern computer systems. A memory error is an event that leads to the corruption of one or more bits in the memory. Memory errors can be caused by electrical or magnetic interference (e.g., due to cosmic rays), can be due to problems with the hardware (e.g., a bit being permanently damaged), or due to corruption along the data path between the memory and the processing elements.
Most enterprise systems employ different mechanisms to recover from different types of memory errors. The recovery mechanism can be in the hardware or at the software level. At the hardware level, techniques that include error correcting codes (ECCs) are used to recover from single bit errors and some types of multi-bit errors. Hardware techniques cannot be used to recover from every type of memory error. For example, hardware techniques cannot recover from memory errors if the number of affected bits exceeds the ECC correctable limit of the particular ECC being implemented. Memory errors that are automatically detected and corrected by hardware are referred to as correctable errors (CEs). Memory errors that are detected by hardware but that cannot be corrected by hardware techniques are referred to as uncorrectable errors (UEs). UEs are passed on to the software (e.g., firmware, kernel) through a non-maskable interrupt signaling a non-recoverable hardware error in the system memory. Depending on the location of the UE, software employs different methods in an attempt to recover from the UE. Not all UEs can be recovered at the software level and an UE that cannot be recovered can lead to a system crash.
A strong correlation has been found to exist between CEs and UEs. A CE on a memory location can increase the probability of a future UE at the same memory location.