Live migration of a virtual machine (VM) refers to the transfer of a running VM over the network from one physical machine to another. Within a local area network (LAN), live VM migration mainly involves the transfer of the VM's CPU and memory state, assuming that the VM uses network attached storage, which does not require migration. Some of the key metrics to measure the performance of VM migration are as follows.                Total migration time is the time from the start of migration at the source to its completion at the target.        Downtime is the duration for which a VM's execution is suspended during migration.        Network traffic overhead is the additional network traffic due to VM migration.        Application degradation is the adverse performance impact of VM migration on applications running anywhere in the cluster.        
The present invention relates to gang migration [8], i.e. the simultaneous live migration of multiple VMs that run on multiple physical machines in a cluster. The cluster, for example, may be assumed to have a high-bandwidth low-delay interconnect such has Gigabit Ethernet [10], 10 GigE [9], or Infiniband [15], or the like. Datacenter administrators may need to perform gang migration to handle resource re-allocation for peak workloads, imminent failures, cluster maintenance, or powering down of several physical machines to save energy.
The present technology specifically focuses on reducing the network traffic overhead due to gang migration. Users and service providers of a virtualized infrastructure have many reasons to perform live VM migration such as routine maintenance, load balancing, scaling to meet performance demands during peak hours, and consolidation to save energy during non-peak hours by using fewer servers. Since gang migration can transfer hundreds of Gigabytes of data over the network, it can overload the core links and switches of the datacenter network. Gang migration can also adversely affect the performance at the network edges where the migration traffic competes with the bandwidth requirements of applications within the VMs. Reducing the network traffic overhead can also indirectly reduce the total time for migrating multiple VMs and the application degradation, depending upon how the traffic reduction is achieved.
The development of new techniques to improve the performance, robustness, and security of live migration of virtual machines (VM) [100] have emerged as one of the critical building blocks of modern cloud infrastructures due to cost savings, elasticity, and ease of administration. Virtualization technologies [118, 58, 79] have been rapidly adopted in large Infrastructure-as-a-Service (IaaS) platforms [46, 107, 111, 112] that offer cloud computing services on a utility-like model. Live migration of VMs [116, 5, 13] is a key feature and selling point for virtualization technologies.
Live VM migration mechanisms must move active VMs as quickly as possible and with minimal impact on the applications and the cluster infrastructure. These requirements translate into reducing the total migration time, downtime, application degradation, and cluster resource overheads such as network traffic, computation, memory, and storage overheads. Even though a large body of work in both industry and academia has advanced these goals, several challenges related to performance, robustness, and security remain to be addressed.
First, while the migration of a single VM has been well studied [74, 5, 18, 58, 129], the simultaneous migration of multiple VMs has not been thoroughly investigated. Secondly, the failure of the participating nodes during live VM migration and the resulting loss of VM state has not been investigated, even though high-availability solutions [130, 108] exist for steady-state VM operation.
Prior efforts to reduce the data transmitted during VM migration have focused on the live and non-live migration of a single VM [74, 5, 13, 133, 58, 129, 134, 95, 81, 123, 122, 135, 92, 94], live migration of multiple VMs running on the same physical machine [8], live migration of a virtual cluster across a wide-area network (WAN) [91], or non-live migration of multiple VM images across a WAN [57]. Numerous cluster job schedulers exist such as [136, 107, 137, 138, 139, 63, 109], among many others, as well as virtual machine management systems, such as VMWare's DRS [117], XenEnterprise [140], Usher [68], Virtual Machine Management Pack [141], and CoD [142] that let administrators control jobs/VM placement based on cluster load or specific policies such as affinity or anti-affinity rules.