Many large-scale computing environments such as high-performance computing and cloud computing environments may incorporate long-running and highly dependent processes. Crashes or other errors occurring in the course of such long-running processes may cause the loss of application state and thus may require large amounts of computational work to be repeated. Accordingly, crashes in large-scale computing environments may be quite costly and time-consuming.
Some typical computing environments support software-based application checkpointing. Application checkpointing allows the computing environment to store periodic snapshots of the state of a running application. The application may be resumed or replayed starting from the state of a saved checkpoint, which may allow for quicker or less-expensive crash recovery. Typical checkpointing solutions are purely software-based. Thus, software checkpointing support may have to be specifically re-engineered for each supported application and/or operating system. Software virtualization solutions such as hypervisors and virtual machine monitors also typically support creating and restoring snapshots of virtual machines, which may provide similar functionality.