1. Field of the Invention
The present invention relates generally to the data processing field and, more specifically, to a method and system for moving an application executing on a virtual machine running on one physical machine to another virtual machine running on a different physical machine.
2. Description of the Related Art
There are a host of reasons for which the live migration of an applications running on a virtual machine is desirable. The term “migration” means that an application executing on a first virtual machine running on a first physical machine, is moved to a second virtual machine running on a different physical machine. The physical machines may be connected to one another over a local area network (LAN) or a wide-area network (WAN). The term “live migration” means that the migration is taking place while the application is running on the first virtual machine.
Live migration may be triggered, for example, by a planned or unplanned maintenance of a data center, by a consolidation, load balancing or optimization of resources in a data center, or by an external catastrophic condition. Migration may take place as a result of a human decision or due to a systems management service decision independent of the application, and should not affect the behavior of the application. The only effect of live migration should be some responsiveness delays, and even these delays should be minimized.
Migration can take place at many levels: the virtual machine, the operating system, the language runtime, or even the application. Migration at the level of the virtual machine is the most general, because the migration mechanism can be unaware of the guest operating system, of the programming language or of any other architectural feature of the application being migrated. Migration transfers the virtual memory, the external storage (disk) and network connections from a source machine to a target machine. The present application is concerned with the transfer of the virtual memory.
The most efficient known techniques for the transfer of virtual memory involve a two-phase process, a “pre-copy” phase and a “demand-paging” phase. During the pre-copy phase, selected pages are copied from the source machine to the target machine. Since the transfer must appear to occur as of a single instant of time, any pre-copied pages which have been modified (or “dirtied”) after having been pre-copied and before the pre-copy phase has ended must be re-sent. After some number of pages has been pre-copied, the application is halted in the source machine, and a start message is sent to the target machine identifying which pages have been pre-copied and which pages have not yet been sent, and the demand-paging phase begins. In the demand-paging phase, the source machine continues to send the remaining pages while the application now runs on the target machine with the pages so-far sent, subject, however, to the condition that if an as-yet-unsent page is referenced, the application will take a page fault and the target machine will send a demand page request to the source machine and wait for that particular page to arrive.
It would be desirable to reduce the time required to perform the live migration process from a source machine to a target machine. In particular, it would be desirable to optimize the total migration time, i.e., the time from the beginning of the pre-copy phase until the end of the demand-paging phase; and to minimize the disruption time; i.e., the time that the application cannot run due to reasons caused by the migration—namely, when the source machine is halted and the target machine has not yet received the start message, or when the target machine is waiting due to a page fault. Total migration time is affected by both disruption time and by the prolongation of the pre-copy phase due to the need to re-send some pages. It is desirable to minimize total migration time, because during the migration, resources in both source and target machines must be reserved on behalf of the migrating application and the source machine may not yet be freed up for other purposes. It is desirable to minimize disruption time because during disruption periods the application cannot make progress, and queues of service requests build up.