As the advantages of virtual machine (VM) technology have become widely recognized, more and more companies are running multiple virtual machines on a single host platform as a way of consolidating multiple workloads on the same server; such consolidation improves the utilization of their computing resources and reduces costs. In addition, just as virtual machine technologies create abstracted versions of the underlying hardware to achieve server consolidation, so too can virtual machine technology be used to achieve software fault tolerance of two or more virtual machines running on separate physical host platforms.
Two virtual machines running on two separate physical hosts are a “fault-tolerant” virtual machine pair that behaves, as far as the outside world is concerned, as a single “logical” virtual machine. Such an organization of two virtual machines protects against a single failure; that is, if one virtual machine fails or its physical host crashes, the other virtual machine takes over and continues executing operations as if nothing untoward had occurred. In such an approach, one virtual machine in the pair is designated the primary virtual machine and the other virtual machine is designated the secondary virtual machine. Users interact with the logical virtual machine only via the primary virtual machine; the secondary virtual machine is invisible.
In order for the primary virtual machine to fail over to the secondary virtual machine without loss of availability or data, the secondary virtual machine needs to have the same state information that the primary virtual machine had at the time of the primary virtual machine's failure. To ensure this, the primary virtual machine during normal operation sends enough data to the secondary virtual machine such that the state of the secondary virtual machine tracks the state of the primary as closely as possible. If the secondary virtual machine tracks the state of the primary virtual machine exactly at every instruction boundary, then the secondary virtual machine is said to be in “lockstep” with the primary virtual machine. Unfortunately, lockstep virtual machines severely affect performance during normal operation because the primary virtual machine must wait—synchronously—for the secondary virtual machine to update its state before returning successfully.
To achieve better performance during normal operation than a lockstep approach—at the cost of potential disruption of service because of longer takeover times—the secondary virtual machine's state is allowed to lag behind the primary virtual machine state. This approach is sometimes called “virtual lockstep.” A drawback of this approach is that upon failure the secondary virtual machine is not available immediately. Before it becomes available, it must catch up to the primary virtual machine's state at the time the primary virtual machine failed.