As large-scale enterprises continue to adopt virtualization platforms as the foundation of their data centers, virtual machine (VM) fault tolerance has become an increasingly important feature to be provided by virtualization platform providers. Because a single host server in a virtualized data center can support multiple VMs, failure of that host server can bring down a multitude of services that were provided by the different VMs running on the failed host server. As such, virtualization platforms need to provide a mechanism to quickly resurrect a failed VM on a different host server so that the enterprise can maintain the quality of its service.
Currently, providing fault tolerance for a primary VM is typically achieved by providing a backup VM that runs on a server residing in a different “fault domain” from the server of the primary virtual machine. A fault domain can generally be described as a set of host servers in a data center (or data centers) that share a number of specified attributes and/or characteristics that results in a higher probability of failure of host servers in the fault domain upon a failure of one of the host servers in the fault domain. The attributes and/or characteristics utilized by an enterprise to define its data center fault domains depend upon the type of disasters and the level of recovery that the enterprises desire to achieve. For example, an enterprise may choose to define its fault domains based upon the physical proximity of host servers (storage rack location, geographic locations, etc.), the dependency of such servers on shared hardware (networked storage, power sources, physical connections, etc.) or software technologies (shared file systems, etc.), and the like. A well-constructed fault domain minimizes the correlation of a failure of a VM in one fault domain with the failure of another VM in a different fault domain.
VM fault tolerance may be provided using deterministic replay, checkpointing, or a hybrid of the two, which is disclosed in U.S. patent application Ser. No. 12/259,762, filed on Aug. 28, 2008, the entire contents of which are incorporated by reference herein. With replay techniques, essential portions of a primary VM's instruction stream (e.g., non-deterministic events within the primary VM's instruction stream) are captured in real-time (e.g., by a hypervisor layer or virtual machine monitor component of the primary VM) and transmitted to a backup VM (e.g., presumably located in a different fault domain) to “replay” the primary VM's execution in a synchronized fashion. If the primary VM fails, the backup VM can then take over without discernable loss of time. While replay techniques provide a robust fault tolerance solution with fast recovery times, they are less viable, for example, when non-deterministic events become more frequent or more difficult to identify within instruction streams, as is the case with virtual machines that support SMP (symmetric multiprocessing) architectures with multiple virtual CPUs.
In contrast to replay techniques, checkpointing based fault tolerance techniques are more flexible in their capabilities to support a variety of virtual architectures, including SMP-based virtual machines. Techniques for generating and using checkpoints in a virtual computer system are disclosed in U.S. Pat. No. 7,529,897, the entire contents of which are incorporated by reference herein. With checkpointing, the primary VM is periodically stunned (i.e., execution is temporarily halted) during the course of execution (each such stun period referred to as a “checkpoint”) to determine any modifications made to the state of the primary VM since a prior checkpoint. Once such modifications are determined, they are transmitted to the backup VM which is then able to merge the modifications into its current state, thereby reflecting an accurate state of the primary VM at the time of the checkpoint. Only upon notification of a failure of the primary VM does the backup VM begin running, by loading the stored state of the primary VM into its own execution state. However, due to the potentially large size of checkpoint information (e.g., multiple gigabytes) in a transmitted state and the need to stun the primary VM at periodic checkpoints to transmit such state to the backup VM, the backup VM must be networked to the primary VM with sufficiently high bandwidth such that the stun period is not prolonged by network bandwidth limitations. This constraint currently restricts the ability to locate backup VMs in locations that are geographically distant from the primary VM or otherwise in a manner in which backup VMs are connected to primary VMs using network connections having insufficient bandwidth capacity to effectively transmit checkpoint information.