Virtual machine (VM) suspend/resume is a feature in virtualized computer systems that allows administrators to save the running state of a VM and at a later time restore the VM to the exact same running state as when it was suspended. One benefit of resuming from a suspended VM is that the VM does not have to go through a complete boot cycle and as a result the VM can be brought on-line quickly with little or no disruption experienced by users.
The VM suspend/resume feature allows administrators to make efficient use of server resources that support the running VMs. Inactive, but otherwise live, VMs can be suspended to prevent them from consuming server resources. The server resources allocated to the suspended VMs can be re-allocated to active VMs that may benefit from the extra resource availability. In a virtual desktop environment, such as Virtual Desktop Infrastructure (VDI) which is commercially available from VMware, Inc., the resource savings can be enormous because studies have shown that many users stay logged into their remote desktops even though they have disconnected from the remote desktop sessions.
The process of suspending a VM is also referred to as checkpointing, which is described in U.S. Pat. No. 6,795,966, incorporated by reference herein in its entirety. During the VM suspend process, a file (known as a checkpoint file) is created on a storage device, typically a disk array, and the state of the VM, including its memory and CPU state, is stored in the file. During VM resume, this same file is loaded into memory to restore the state of the VM. With a shared storage device, it is possible to resume the VM on a different host than where it was suspended.
The VM suspend/resume process described above works well for the occasional suspend/resume of VMs, but does not scale if many VMs are suspended or resumed at the same time. When a large number of VMs are suspended at the same time, known as a “suspend storm,” the process can take a long time to complete and consequently the benefits gained from freeing up hardware resources from the VM suspensions would be delayed. For example, if 100 VMs each having 4 GB of allocated memory are suspended at the same time, this would result in a 100×4 GB (400 GB) of data being written to the storage device. The same applies to a “resume storm,” where many users are requesting connections to their VMs at about the same time. In the above example of 100 VMs, if users of such VMs were to request connections to their VMs at about the same time, the VM resume process would require 400 GB of data to be read from the storage device and loaded into memory, inevitably delaying many of the connections requested by the users.
While the impact of the suspend storm can be mitigated to an extent, by scheduling the VM suspensions in a staggered fashion to offset the load on the storage device, the resume storm cannot be staggered, because users are expecting to access their VMs shortly after they have requested access. As a result, the storage device becomes a bottleneck when resuming a large number of VMs at about the same time.