The present invention relates generally to the electrical, electronic and computer arts, and, more particularly, to methods, apparatus and systems for memory failure notification.
In high performance computing (HPC), uncorrected errors in the main memory (“memory”) of the computer are one of the main reasons HPC systems crash or fail. For example, uncorrected errors may cause a crash due to an unrecoverable corruption of an operating system of the HPC system or an application running on the HPC system, which then may require the system or application to be restarted. After the crash, sometimes the application may resume from a predefined checkpoint.
A machine check is one way in which system hardware may indicate an internal error. Machine check handlers have been used to signal to the operating system the occurrence of memory parity check errors encountered by a memory controller and that cannot be corrected by a memory protection mechanism, such as Error-Correcting Codes (ECC), for instance. The memory controller also accounts for corrected and harmless errors. Corrected and harmless errors are errors that do not generate a machine check exception. As is well known by those skilled in the art, a machine check exception occurs when an error cannot be corrected by the hardware and in turn signals a machine check handler. Corrected and harmless errors may typically be tracked. Logs of corrected errors and the monitoring of a corrected error count compared to static thresholds have been used in proactive HPC system failure avoidance.