Live migration is the process whereby a running application is moved between different physical machines without disconnecting a client or the application. Applications to which live migration may be applied include virtual machines, containers or other processes or tasks run by an operating system. A virtual machine is an emulation of a computer system, such as an operating system or application environment. A container or lightweight virtualization is an operating system level virtualization environment for running multiple isolated operating systems on a single host. Each container is isolated from each other and has access to a subset of the resources of the operating system.
In live migration, the memory, storage, and network connectivity of the application are transferred from the original host machine to the destination host machine. Live migration allows an administrator to shut down a physical server for maintenance or upgrades without subjecting the system's users to downtime. It also facilitates proactive maintenance since, if an imminent failure is suspected, the potential problem can be resolved before disruption of service occurs. Live migration can also be used for load balancing, in which work is shared among computers in order to optimize the utilization of available CPU resources.
In order to be effective, live migration must cause the minimum of disruption to continued execution. To achieve this aim, two main techniques are used, namely, pre-copying and post-copying.
In pre-copy memory migration, the data is copied from source memory to destination memory while the application is still running on the source. If some memory pages change (become ‘dirty’) during this process, they will be re-copied. This first stage is referred to as the “warm-up” phase. After the warm up phase, the “stop and copy” phase is performed. This involves the stopping of the application on the source host and the remaining “dirty” pages being copied to the destination host. There is a downtime between the stopping of the application on the source host and its resumption on the destination host.
In post-copying, the application is suspended at the source host and a minimal subset of the execution state of the application is transferred to the destination host. The application is then resumed at the destination host. Concurrently the source host pushes the data in the remaining memory pages to the destination memory, in a process known as pre-paging. If at the destination host, the application tries to access a memory page that has not yet been transferred, it develops a so-called “network fault”. If this occurs, then the destination host accesses the memory in the source host. Too many network faults can degrade the performance of applications running in the application.
A number of attempts have been made to improve the performances of both pre-copy and post-copy live migration, and combinations of the two techniques. An early proposal was made by Clark et al (“Live Migration of Virtual Machines”, 2nd Symposium Networked Systems Design and Implementation 2005). This paper proposes a pre-copy process, but identifies a “writable working set”, which is a set of memory pages which are continuously updated and which are therefore not worth copying in the pre-copy phase. These memory pages are only copied in the downtime period.
In US2014/0089393, a tracking tool is used to track modifications made in the data by the application at the source host. When the source host is shutdown, these modifications are passed to the destination host. In U.S. Pat. No. 8,689,211 a recovery point in case of migration failure is defined.
These methods either impose a downtime (e.g., pre-copy) or are subject of possible corruption in case of any failures during the process (e.g., post-copy, and combination of pre-copy and post-copy). U.S. Pat. No. 8,671,238 discloses a method of overcoming these problems by using a shared data-store that stores a file with the instance memory, which is accessible to both source and destination hosts in case the direct communication between hosts fails. It does not however suit all cases, namely in the case of highly time critical processing instance and an approach that does not require a third component (i.e. a shared data-store) is more suitable and efficient.