Many large-scale computing environments such as high-performance computing (HPC) and cloud computing environments may incorporate distributed or multi-tier applications and workloads. In other words, more than one instance of a workload may be executing at the same time across multiple applications and/or computing devices (e.g., servers). Crashes or other errors occurring in the course of processing such distributed workloads may cause the loss of application state and thus may require large amounts of computational work to be repeated. Accordingly, crashes in large-scale computing environments may be quite costly and time-consuming.
Some HPC and cloud computing environments support software-based application checkpointing. Typical application checkpointing solutions are purely software-based and allow the computing environment to store periodic snapshots (i.e., checkpoints) of the state of a running application, a virtual machine, or a workload in a non-distributed or single-tier computing environment. Based on the saved checkpoints, a suspended or interrupted application may be resumed or replayed starting from the state of a saved checkpoint, which may allow for quicker or less-expensive crash recovery. However, software checkpointing support may require the checkpointing software to be re-engineered for each supported application and/or operating system. Further, such software-based checkpointing solutions (e.g., hypervisors, virtual machine monitors, etc.) are typically dependent on various factors of the single-tier or non-distributed environment, such as the vendor, the operating system, the type of virtual machine, the application, etc.