The advantages of virtual machine (VM) technology have become widely recognized. Among these advantages is the ability to run multiple virtual machines on a single host platform. This makes better use of the capacity of the hardware, while still ensuring that each user enjoys the features of a “complete,” isolated computer.
The advantages of various types of checkpointing are also widely recognized, such as providing a backup of some aspect of a computer system and providing the ability to revert back to a previously generated checkpoint to undo changes to some aspect of the computer system or to recover from a failure affecting the computer system. One particular use of checkpointing that is advantageous is to capture the state of a long-running computation, so that, if the computation fails at some point, it can be resumed from the checkpointed state, instead of having to restart the computation from the beginning.
Fast and frequent checkpointing of virtual machines is a useful technology for a number of applications: (1) continuous checkpointing allows users to revert back their application to almost any previous point in time; (2) reverse debugging based on deterministic replay also requires frequent checkpoints to reduce the amount of replay from a previous checkpoint that is required to execute backwards; (3) fast checkpoints can enable the possibility of speeding up an application by allowing speculative calculations that can be reverted if necessary; and (4) fast checkpoints enable fault tolerance.
With respect to (4), checkpoints mirror a primary VM with a secondary VM, such that the secondary VM can resume without loss of data if the host running the primary VM is terminated due to hardware failure. One of the key techniques that ensure no observable data loss from clients is the fact that the primary VM must withhold all network output until it has sent and received acknowledgement for all the data for the checkpoint that follows the network output. If it does not do this, it is possible that upon failover, the secondary VM may not have data that the primary VM had acknowledged to clients, causing an irreparable inconsistency. Such an issue is prevented by withholding the network output of the primary VM until the secondary VM receives all the data for the checkpoint data up to that point.
While withholding the network output of the primary VM ensures correctness, this creates a dependency between the latency of network output for the primary VM, and the size and frequency of checkpoints. On the one hand, it is best to take as frequent checkpoints as possible if the goal is to minimize network output latency. On the other hand, taking checkpoints frequently adds considerable CPU overhead to the VM, stealing away from the VM CPU cycles to be used for checkpointing related tasks instead. Thus, when considering CPU utilization, it is best to take checkpoints as infrequently as possible.
The interplay between these two components of performance on the overall workload performance is subtle. If the checkpoint is taken too often, CPU cycles are wasted unnecessarily, and if not often enough, the network output latency is increased unnecessarily. The solution to this problem is not obvious because in general it is not possible to tell whether a workload would rather trade CPU for network latency or vice versa. There is really no good way to determine what is the right balance for a generic workload.
In addition, the balance resulting from a given checkpoint frequency is highly workload-dependent. Given this difficulty, a common solution to this problem is to create a fixed frequency timer that simply takes checkpoints at regular intervals, with the frequency set to an arbitrary fixed value. Such a solution, however, ignores the trade-offs between minimizing network latency and minimizing CPU overhead, and ignores opportunities for optimization that may be workload-specific.