The present invention relates generally to the electrical, electronic and computer arts, and, more particularly, to memory error protection and management.
In high-performance computing (HPC), typically two or more servers or computers are connected with high-speed interconnects in an HPC cluster. A cluster consists of several servers networked together that act as a single system, where each server in the cluster performs one or more specific tasks. Each of the individual computers or servers in the cluster may be considered a node. The nodes work together to accomplish an overall objective. As such, subtasks are executed on the nodes in parallel to accomplish the overall objective. However, a failure of any one subtask results in a failure of the entire parallel task.
Uncorrected errors in the main memory (“memory”) of the computer are one of the primary reasons HPC systems crash or fail. For example, uncorrected errors may cause a crash due to an unrecoverable corruption of an operating system of the HPC system or an application running on the HPC system, which then may require the system or application to be restarted. After the crash, sometimes the application may resume from a predefined checkpoint.
A machine check is one way in which system hardware may indicate an internal error. Machine check handlers have been used to signal to the operating system the occurrence of memory parity check errors encountered by a memory controller and that cannot be corrected by a memory protection mechanism, such as error-correcting codes (ECC), for instance. The memory controller also accounts for corrected and harmless errors. Corrected and harmless errors are errors that do not generate a machine check exception. A machine check exception occurs when an error cannot be corrected by the hardware and in turn signals a machine check handler. Corrected and harmless errors may typically be tracked. Logs of corrected errors and the monitoring of a corrected error count compared to static thresholds have been used in proactive HPC system failure avoidance.