Operating system reboots disrupt applications by causing them downtime and by destroying derived and cached states they maintain in virtual memory thereby degrading their performance. Rebooting an operating system involves shutting down the running operating system and immediately starting it. There are several reasons for rebooting an operating system. For example, hardware maintenance and upgrades typically require the operating system to be offline before the hardware can be modified. More frequently, a reboot is required to apply code and configuration updates, and the operating system cannot adopt these updates without restarting.
Rebooting the operating system disrupts the applications running on the system, which must close client connections, commit their state to storage, and shut down. During the restart, those applications must then restore their state, rebuild memory caches, and resume accepting client connections. These disruptions are magnified in a virtualized environment because the reboot affects not only the applications operating on a host partition, but also the applications running on the hosted virtual machines.
During a reboot, applications running on a virtual machine will be offline during the time required to: shut down the virtual machine, shut down the host, run firmware Power-on Self-Test (POST), startup the host, startup the virtual machine, and startup the application. In some cases, the duration of this outage may be on the order of thirty minutes or more. If a Service Level Agreement (SLA) requires a specific availability for the application, the downtime caused by host operating system reboots will consume at least a portion of the SLA's downtime budget. This will leave less time in the SLA downtime budget for unplanned outages, which are unpredictable in terms of frequency and duration.
To mitigate the impact of host-caused reboots on virtual machines, most small-scale virtualization platforms have implemented live migration, which enables virtual machines to seamlessly move from one server to another in order to avoid a host's planned reboot. The downsides of live migration are that it adds significant complexity to overall system management, places a burden on networking resources, and extends the time required to apply updates. Rebooting a group of servers requires migrating every virtual machine at least once. And unless an empty server is paired with every one hosting virtual machines that will be migrated, the migration of virtual machines becomes a tile shuffle game and server updating can become a serial operation.
Virtual machine suspend-update-resume (VM-SUR) is an alternative to shutting down virtual machines based on existing virtual machine technology. With this approach, the host OS suspends virtual machines, saves their state (including RAM and virtual CPU) to disk, restarts the server into the updated host OS, and then resumes the virtual machines. This allows virtual machines to retain their in-memory caches and avoids virtual machine shutdown and restart. The drawback of VM-SUR is that the RAM of all virtual machines hosted on a server must be read and written to local storage as part of the host OS update, during which time the virtual machines are suspended. Using approximate numbers that reflect contemporary cloud hardware, the save and restore of 100 GB of RAM to local storage that has throughput of 100 MB/s would take about thirty minutes. That disruption is no better than that caused by a typical shutdown/restart and while virtual machines retain their caches, the downtime would be long enough to cause a visible outage for single-instance virtual machines.