With the rapid development of high performance computing (HPC) software and hardware, supercomputing has entered into the petascale HPC era. Given the size of high-end large-scale clusters common in such supercomputers, fault tolerance becomes an essential design factor for which HPC creators have to consider a user production environment, because of the known mean time between failures (MTBF) issues which dramatically drop in the very large scale parallel computing. Currently checkpoint/restart is the mainstream approach applied by the leading HPC vendors.
For example, the IBM HPC software stack provides a container-virtualization-based checkpoint/restart function as a solution to fault-tolerance. This mechanism is a light-weight checkpoint technique when compared against other system-level checkpoint methods. A process such as IBM MDCR (Metacluster Distributed Checkpoint/Restart) is distributed middleware, working with the IBM parallel environment, and is capable of spawning parallel application programs in containers in order to manage or control the containers to checkpoint or restart in runtime, if MDCR gets such a request from the resource manager. In particular, the IBM checkpoint/restart processing is transparently and automatically running without any changes to parallel program code.
In a high-end HPC cluster, nevertheless, during the running of a full-size parallel application, if checkpoint is performed, the checkpoint statefiles will produce a very large amount of data, which will easily breakdown or slowdown a robust high performance global shared file system. This creates a paradox; on the one hand, checkpoint is more and more inevitable in ultra-large scale HPC clusters, and on the other hand, for a full-size parallel application, the checkpoint solution suffers from the massive data of statefiles, which makes it impractical.