Checkpointing is a technique for inserting fault tolerance into computing systems. It includes, for example, storing a snapshot of a current application state, and using it for restarting the execution of an application in case of failure. The computing system that employs checkpointing may be virtualized such that a single computer system may have multiple operating systems, in the form of Virtual Machines (VMs), managed by a hypervisor (e.g., XEN), or other suitable virtual machine monitor. Software checkpointing schemes may be incremental stop (e.g., Copy On Write (COW), or Dirty Bit) or full stop.
In COW, all memory pages of each VM in the computer system are initially marked as read/only. The first modification of any page causes a hypervisor trap (i.e., an exception is thrown due to the attempted modification of the page). In servicing the trap, the hypervisor copies the original page into a ring buffer until the next checkpoint is declared. A checkpoint is declared either after a fixed time (e.g., 1 second), or if the ring buffer gets more than half full. Declaring a checkpoint pauses the VM just long enough to mark all pages r/o and start a new checkpoint ring (R2). The checkpoint can be saved (stable storage or remotely) by copying the new version of each page in previous ring (R1) (either from the VM if that page is still r/o or from R2).
Dirty Bit checkpointing is similar. All pages of the VM are initially marked clean. Any page modified will have the hardware dirty bit for that page. Declaring a checkpoint pauses the VM while all pages marked dirty are copied and then all pages marked clean again. The VM is executed with no overhead until the next checkpoint.
In full stop, the VM is paused and a full checkpoint is taken before execution of an application is continued. As compared to COW, there is no overhead (i.e., no COW overhead and no ring buffer of modified pages) before the checkpoint event. This said, there is, however, significant overhead at checkpoint time due to the taking of a full stop checkpoint.