High Performance Computing (HPC) and cluster computing involve connecting individual computing nodes to create a distributed system capable of solving complex problems. These nodes may be individual desktop computers, servers, processors or similar machines capable of hosting an individual instance of computation. More specifically, these nodes are constructed out of hardware components including, but not limited to, processors, volatile memory (RAM), magnetic storage drives, mainboards, network interface cards, and the like.
Scalable HPC applications require checkpoint capabilities. In distributed shared memory systems, checkpointing is a technique that helps tolerate the errors leading to losing the effect of work of long-running applications. Checkpointing techniques help preserve system consistency in case of failure. As cluster sizes grow, the mean time between failure decreases, which requires applications to create more frequent checkpoints. This drives the need for fast checkpoint capabilities.