The present invention relates generally to electrical, electronic and computer arts, and more particularly relates to memory components and systems.
Frequent system failures due to an increased number of uncorrected memory errors corrupting critical data is a major problem when scaling current high-performance computing (HPC) systems. Mechanisms for fault handling have been deployed at various levels of the software stack, ranging from application and runtime levels, to operating system-level schedulers. Checkpoint/restart techniques have been used as an approach to react to and recover from the occurrence of a system failure. Checkpoint/restart is a facility of the operating system that allows information about an application to be recorded (e.g., in the form of a checkpoint) so that the application can be restarted from the point where it was interrupted after an abnormal termination. A checkpoint is a copy of the system's memory that is periodically saved on disk along with current register settings (e.g., last instruction executed, etc.) and any other status indicators. In the event of a system failure, the last checkpoint serves as a recovery point.
Health monitoring capabilities are features commonly employed in commodity and HPC components. Such features have been used to develop predictive failure models and to guide a determination of optimized checkpointing intervals in reactive fault management techniques. Research on the analysis of log files for prediction purposes demonstrates that accurate models for memory failure prediction can be obtained from memory error event history. Proactive schemes exploiting health monitoring capabilities for failure prediction, such as, for example, process-level migration from healthy to unhealthy nodes, have been proposed. The tolerance of a set of scientific applications to the impacts of uncorrected errors has been shown, and also the potential benefits of cooperative fault recovery mechanisms.
It has been projected that standard fault-tolerant methods will not be sufficient to handle expected memory error rates affecting next-generation HPC systems. As systems scale, mechanisms to avoid critical errors otherwise capable of causing a system failure (e.g., crash), and at the same time allowing the system to continue running in the presence of tolerable errors, will be required. However, while proactive solutions exist, such solutions are highly dependent upon specific system implementations at different layers, and are thus undesirable.