Conventional checkpointing techniques are utilized in various types of computing systems as a means to provide some level of fault tolerance against system failures, e.g., server crashes. In general, a checkpointing process comprises generating a checkpoint image (or snapshot) of the current in-memory state of an application, so that the application can be restarted from that checkpoint in case of a system failure during execution of the application. The checkpointing process is particularly useful for long running applications that are executed in failure-prone computing systems.
For example, in a high-performance computing (HPC) domain, long running, heavy computing intensive processing tasks (e.g., training deep learning models) dominate the workloads of server resources within a computing server cluster, and such intensive processing tasks can take hours, days or even weeks to execute certain tasks and deliver results. It is common for a server within the computing system to experience some error at some point during the execution of a relatively long processing task, or otherwise have the processing task preempted at some point in the execution to execute a higher priority task. Such error can range from a software error, a memory failure, a power failure, or even a natural disaster. The process of recovering a computing result by re-executing the program from the beginning to the break point is generally not a good solution due to the long running time of the processing task and the heavy computing power requirements. Therefore, checkpointing a current program state in non-volatile storage is a more optimal solution to make the system robust and failure tolerant.
Checkpointing in a cloud environment faces many challenges. Such challenges include, but are not limited to, long synchronization overhead, large data movement over the network, significant use of system resources such as system memory and storage bandwidth, etc. In this regard, checkpointing in a cloud computing environment militates against single point checkpointing, wherein all the data is collected at one place and stored. Indeed, Big Data applications can involve terabyte levels of data, and transferring the data repeatedly to one machine for checkpointing wastes computation power, network bandwidth, and storage.