There are many reasons for migrating a running virtual machine (VM) from one system to another in a network or cluster of processing nodes. These reasons may include: (a) balancing computing load across nodes—if one node is out of resources while other nodes have free resources, then VMs can be moved among nodes to balance the computing load; (b) individual nodes of a cluster can be shut down for maintenance without shutting down VMs running on the node—the VMs can be migrated to other nodes in the cluster; and (c) new nodes can be immediately utilized as they are added to the cluster—currently running VMs can be migrated from nodes that are over-utilized to newly added nodes that have free resources. In addition, it may be necessary to add or remove resources from a server—this need not be related to requirements of the hardware itself, but rather it may be needed to meet the requirements of a particular user/customer. A particular user, for example, may request (and perhaps pay for) more memory, more CPU time, etc., all of which may necessitate migration of a VM to a different server.
During migration of a VM, the time the migrated VM is unavailable should be minimized. If the VM is unavailable for more than a relatively short time, service level agreements with clients that depend on services exported by the VM may be unmet. In addition, the migration should be transparent to clients of the VM. In further addition, the time the VM is dependent on a state stored on a source machine should also be minimized because, as long as the VM is dependent on the source machine, the VM is less fault-tolerant than before it was migrated.
The ESX Server product from VMware, Inc. of Palo Alto, Calif., provides a mechanism for checkpointing an entire state of a VM. When a VM is suspended, all of its state (including its memory) is written to a file on disk. A VM can then be migrated by suspending the VM on one server, and resuming it via shared storage on another server. Writing out the saved state, especially the memory, to disk and then reading it back in again on the new server can take a relatively large amount of time, especially for VMs with large memories. A 512 Mbyte VM, for example, takes about 20-30 seconds to suspend and then resume again. This may be an issue as a delay as short as ten seconds may be noticeable to a user or in violation of a service level agreement.
Nelson et al., in “Fast Transparent Migration for Virtual Machines,” published April 2005, proposed a system for transferring a VM from one physical machine to another physical machine by “pre-copying” virtual machine memory. Further, Sapuntzakis et al., in “Optimizing the Migration of Virtual Computers,” published December, 2002, proposed a mechanism, referred to as a “capsule,” for moving the state of a running computer across a network, including the state of its disks, memory, CPU registers, and I/O devices. As the capsule state is a hardware state, according to Sapuntzakis et al., it includes the entire operating system as well as applications and running processes. Each of these proposals for migrating a virtual machine intends to do so quickly and efficiently while minimizing any perceived disruption to a user.
Some VM migration techniques consume large amounts of disk bandwidth, especially if the suspend-and-resume operations must be done quickly. This can be problematic if the reason a VM is being migrated is because the machine (server) that it is running on is low on available disk bandwidth. Powering down, checkpointing, and restoring are time-intensive operations, in the context of a running system, due to the large amounts of data that must be stored, transferred, and reinitiated.