A virtual machine (VM) is a software abstraction of a physical computing system capable of running one or more applications under the control of a guest operating system, where the guest operating system interacts with an emulated hardware platform, also referred to as a virtual hardware platform. One or multiple VMs and a virtual hardware platform are executed on a physical host device, such as a server-class computer. VMs are frequently employed in data centers, cloud computing platforms, and other distributed computing systems, and are executed on the physical host devices of such systems. Typically, these host devices are logically grouped or “clustered” together as a single logical construct. Thus, the aggregated computing and memory resources of the cluster that are available for running VMs can be provisioned flexibly and dynamically to the various VMs being executed.
However, there are also drawbacks to organizing host devices in clusters when executing VMs. For example, when cluster utilization is nearly full, i.e., when computing, memory, and/or networking resources of a cluster are fully utilized, VM availability can be compromised and VM latency increased significantly. While the performance of VMs in a cluster with high utilization can be improved by a system administrator manually adding host devices to the cluster and/or migrating VMs across clusters (e.g., to a less utilized cluster), such customizations are generally not scalable across the plurality of clusters included in a typical distributed computing environment and require VMs to be powered down. Further, performing such manual customizations in real time in response to dynamic workloads in a cluster is generally impracticable. Instead, manual customization of clusters is typically performed on a periodic basis, e.g., daily or weekly.
In addition, to maximize VM availability, clusters of host devices often include reserved failover capacity, i.e., host devices in the cluster that remain idle during normal operation and are therefore available for executing VMs whenever a host device in the cluster fails. Such reserved failover capacity can make up a significant portion of the resources of a cluster, but are infrequently utilized. For example, for a distributed computing system that includes 50 clusters, where each cluster includes 10 host devices and has a failover capacity of 20%, then the capacity equivalent to 100 host devices are unused in the system until a failure occurs. Because failures are relatively infrequent, the majority of this reserved failover capacity is infrequently utilized, thereby incurring both capital and operational costs for little benefit.