Some existing systems migrate virtual machines (VMs) from a source host computing device to a destination host computing device. For example, the vMotion process from VMware, Inc. moves live, running VMs from one host to another without any perceptible service interruption. During the existing process of closing disks and releasing locks at the source host computing device, and then reopening disk and acquiring locks at the destination host computing device, the VM ‘downtime’ or switchover time (e.g., the time a VM is not executing guest instructions during vMotion) is noticeable by customers because their workloads are stalled for the duration of that disk ownership handoff.
Some existing methods have switchover times, end-to-end, typically taking less than one second. However, some systems have many more disks per VM, to the point where a single VM may have over 100 disks. Closing and opening 100 disks during the downtime is problematic, at least because it can cause switchover times to extend to 2-5 seconds or greater.
In some examples, the disks are file extents on a VM file system (VMFS) or network file system (NFS), with disk open operations involving little more than simply opening the flat files and taking locks. However, with the advent of virtual volumes (VVOLs) and virtual storage array network (vSANs), object-backed disks are now supported for live migration. With VVOL and vSAN, opening a disk is far more complex. For example, the host calls out to an external entity, such as a vendor provider (VP), to request that the particular object be bound to the host. A number of other calls flow back and forth between the host and VP to prepare and complete the binding process. Only after that communication finishes can locks of the disk be acquired. The disk open is then declared to have completed successfully. Opening a single VVOL or vSAN disk, then, may take a full second or greater thereby increasing the downtime and reducing switchover performance. Moreover, in this example, the switchover performance is now dependent on performance of code from the VP (e.g., to release and bind locks).
Some existing methods of optimizing disk handoff during switchover have involved prepopulating disk lookup information at the destination host, and/or use multiple threads to concurrently open disks. However, there is no guarantee that any number of concurrent requests will be handled in parallel.
Even with the existing methods of disk handoff and live migration, it is increasingly difficult to migrate more complicated and larger systems from a source VM to a destination VM without increasing VM downtime. Further, with some of the existing systems, the disks are maintained by VPs which creates uncontrollable or unknowable VM downtimes because of the partner code run by the VPs. This can create visible delays in processing during live migration that are unacceptable to users.