Rebooting an operating system involves shutting down the running operating system and immediately starting it. There are several reasons for rebooting an operating system. For example, hardware maintenance and upgrades typically require the operating system to be offline before the hardware can be modified. More frequently, a reboot is required to apply code and configuration updates, and the operating system cannot adopt these updates without restarting.
Rebooting the operating system disrupts the applications running on the system, which must close client connections, commit their state to storage, and shut down. During the restart, those applications must then restore their state, rebuild memory caches, and resume accepting client connections. These disruptions are magnified in a virtualized environment because the reboot affects not only the applications operating on a host partition, but also the applications running on the hosted virtual machines.
During a reboot, applications running on a virtual machine will be offline during the time required to: shut down the virtual machine, shut down the host, run firmware Power-on Self-Test (POST), startup the host, startup the virtual machine, and startup the application. In some cases, the duration of this outage may be on the order of thirty minutes or more. If a Service Level Agreement (SLA) requires a specific availability for the application, the downtime caused by host operating system reboots will consume at least a portion of the SLA's downtime budget. This will leave less time in the SLA downtime budget for unplanned outages, which are unpredictable in terms of frequency and duration.
In a cloud environment, services typically require at least two virtual machines running on separate host servers to meet a compute-availability SLA. Using multiple, distributed virtual machines allows the cloud platform to update a first server hosting a first virtual machine while a second virtual machine continues to run on a second server. The second server may be updated after the updated first virtual machine is running again. However, the virtual machines on the rebooted servers lose in-memory caches. In the cloud environment, reboot downtime results in reduced capacity rather than a complete outage. Additionally, if only two virtual machines are used to support a service, then there is a risk of a complete outage during an update. For example, while one server is being updated, the virtual machine on that host is unavailable and, if the server hosting the other virtual machine fails during the update, then the other virtual machine will also be unavailable.
The end-to-end update of a cluster, which may include approximately one thousand servers, takes 12-24 hours depending upon the topology of deployed services, any server failures caused by the hardware reset during the reboot, and the length of time it takes to shut down virtual machines. Unless all of the servers are updated concurrently, which is likely to violate customer SLAs, the cluster's configuration is inconsistent during the end-to-end update and the cluster is likely exposed to the security and reliability issues fixed with the update.
While scale-out PaaS services have reduced capacity during the update, services that have tiers consisting of a single virtual machine—which includes the vast majority of IaaS-based tiers—experience a complete outage. Using thirty minutes as the time a virtual machine is offline during an update, updating once per month allows for only 2.75 hours of unplanned downtime over a year for an application with a 99.9% yearly availability SLA. Given the run rate of software and hardware incidents and variability of update times and unplanned outage mean time to detect (MTTD) and mean time to resolve (MTTR), it's unlikely the platform can meet that SLA for a very high percentage of customers with a monthly update.
To mitigate the impact of host-caused reboots on virtual machines, most small-scale virtualization platforms have implemented live migration, which enables virtual machines to seamlessly move from one server to another in order to avoid a host's planned reboot. The downsides of live migration are that it adds significant complexity to overall system management, places a burden on networking resources, and extends the time required to apply updates. Rebooting a group of servers requires migrating every virtual machine at least once. And unless an empty server is paired with every one hosting virtual machines that will be migrated, the migration of virtual machines becomes a tile shuffle game and server updating can become a serial operation.
Virtual machine suspend-update-resume (VM-SUR) is an alternative to shutting down virtual machines based on existing virtual machine technology. With this approach, the host OS suspends virtual machines, saves their state (including RAM and virtual CPU) to disk, restarts the server into the updated host OS, and then resumes the virtual machines. This allows virtual machines to retain their in-memory caches and avoids virtual machine shutdown and restart. The drawback of VM-SUR is that the RAM of all virtual machines hosted on a server must be read and written to local storage as part of the host OS update, during which time the virtual machines are suspended. Using approximate numbers that reflect contemporary cloud hardware, the save and restore of 100 GB of RAM to local storage that has throughput of 100 MB/s would take about thirty minutes. That disruption is no better than that caused by a typical shutdown/restart and while virtual machines retain their caches, the downtime would be long enough to cause a visible outage for single-instance virtual machines.
Because the surrounding application state may drastically change in the interim, the host would still be obligated to give virtual machines the opportunity to finish in-flight work and gracefully prepare, adding to the downtime. Assuming a 10× throughput improvement for local storage, the update duration is still at least three minutes, well beyond most client timeouts.