The disclosure generally relates to virtual machines and, more specifically, to techniques for virtual machine management.
In computer systems, the use of virtual machines (VMs) is increasingly common, with an individual VM being provided to handle anything from an individual program or process up to a complete operating system (OS). Individual processors may host one or more VMs, with a processor software layer (referred to as a VM monitor (VMM) or hypervisor) that supports the VMs. While it is generally beneficial for VMs to be isolated, inter-communication between VMs is required in many situations. In fault-tolerant systems (typically high-importance systems, e.g., server architectures or alarm systems) back-up provision has been made such that when a component fails a replacement can be switched in to allow operation of the systems to continue with minimal interruption. In a fault-tolerant system that includes multiple VMs, a back-up provision may include additional processing capacity (in some instances on a connected but physically separate machine) within which a replacement VM can be instantiated in the event of failure.
In general, to minimize delays, a replacement VM should be able to take over the operations of a failing VM as quickly as possible. As such, a mechanism should be provided such that a replacement VM is aware of a point in a program or process where the failure occurred so that the replacement VM can resume operation from that point. One option is to run a replacement machine in parallel with an original machine, with the replacement machine receiving the same input data as the original machine. Implementing parallel redundant machines is costly in terms of the duplication of processing power to maintain operation of the replacement machine.
U.S. Patent Application Publication No. 2008/0189468 (Schmidt) and U.S. Pat. No. 7,213,246 (van Rietschote) describe systems of multiple VMs that utilize an alternate strategy. In operation, for a given original VM, a description of the VM and current VM state data are periodically gathered and stored in order to allow for creation of a replacement VM on failure of the original VM. U.S. Patent Application Publication No. 2008/0155208 (Hiltgen) describes a similar system and addresses security issues for handling captured state data. Systems that store a description of a VM and current VM state data have a lower processing overhead than systems that run a parallel VM, but are slower to transition in the event of failure, as it is first necessary to instantiate a replacement VM before the replacement VM can take over operations for an original VM.
A VM mirror is a way of running a VM such that if a failure occurs, the failing VM can be nearly instantly restarted on a second machine. State data is continually exchanged between a primary VM and a secondary machine through a process known as checkpointing, where a state of a primary VM is periodically captured and transferred to a secondary machine in the event of a failure of the primary VM. An example of a checkpointing VM system is described in U.S. Patent Application Publication No. 2010/0107158 (Chen).
In the event of failure, a secondary VM, which is a mirror of a primary VM at a last checkpoint before failure, can take over operations from the last checkpoint before failure. As should be recognized, the shorter the interval between checkpoints, the closer a state of a secondary VM to a state of a primary VM. However, as there is a processing overhead associated with the checkpoint operation, a balance has to be struck between overhead and frequency of checkpointing. An additional issue with a checkpointing system is that in order to avoid duplication of external network traffic generated between a primary VM and its respective secondary VM, any external network data packets generated by the primary VM should be buffered until a subsequent checkpoint has passed. Unfortunately, buffering external network data packets introduces operation delays, especially when a relatively long checkpoint interval is used.