The invention relates to improving the reliability and availability of highly parallel computing systems.
Highly parallel computing systems, with tens to hundreds of thousands of nodes, are potentially subject to a reduced mean-time-to-failure (MTTF) due to a soft error on one of the nodes. This is particularly true in HPC (High Performance Computing) environments running scientific jobs. Such jobs are typically written in such a way that they query how many nodes (or processes) N are available at the beginning of the job and the job then assumes that there are N nodes available for the duration of the run. A failure on one node causes the job to crash. To improve availability such jobs typically perform periodic checkpoints by writing out the state of each node to a stable storage medium such as a disk drive. The state may include the memory contents of the job (or a subset thereof from which the entire memory image may be reconstructed) as well as program counters. If a failure occurs, the application can be rolled-back (restarted) from the previous checkpoint on a potentially different set of hardware with N nodes.
However, on machines with a large number of nodes and a large amount of memory per node, the time to perform such a checkpoint to disk may be large, due to limited I/O bandwidth from the HPC machine to disk drives. Furthermore, the soft error rate is expected to increase due to the large number of transistors on a chip and the shrinking size of such transistors as technology advances.
To cope with such software, processor cores and systems increasingly rely on mechanisms such as Error Correcting Codes (ECC) and instruction retry to turn otherwise non-recoverable soft errors into recoverable soft errors. However, not all soft errors can be recovered in such a manner, especially on very small, simple cores that are increasingly being used in large HPC systems such as BlueGene/Q (BG/Q).
What is needed then is an approach to recover from a large fraction of soft errors without resorting to complete checkpoints. If this can be accomplished effectively, the frequency of checkpoints can be reduced without sacrificing availability.