High availability systems provide high availability for applications running in virtual machines. In the event of a host failure, affected virtual machines are automatically restarted on other hosts with spare capacity. Additionally, if there is an operating system (OS)-related failure within a virtual machine, the failure is detected, and the affected virtual machine is re-started on the same host. The high availability system may include a distributed monitoring solution that continuously monitors all hosts and detects host failures.
The high availability system may leverage a cluster of hosts, which aggregates computing resources for the hosts in a resource pool. Hosts in the cluster are monitored and in the event of a failure, virtual machines on a failed host are re-started on alternate hosts in the cluster. The computing resources in the cluster are managed as if they resided on a single host. Thus, when a virtual machine is re-started, the virtual machine may be given resources from other hosts in the cluster rather than be tied to a specific host that might have failed.
The high availability system includes an agent on every host of the cluster. The agents communicate with each other using heartbeat messages to monitor the aliveness of the hosts in the cluster. A loss of the heartbeat message may indicate that a host has failed. When a host failure is detected, the virtual machines running on that host are failed over. For example, virtual machines are re-started on an alternate host with the most available unreserved capacity, e.g., available computer processing unit (CPU) and memory resources.
The high availability system ensures that sufficient spare computing resources are available in the resource pool at all times to be able to re-start virtual machines on different hosts in the event of a host failure. These spare computing resources are allocated beforehand and are always kept unused. For example, a user may specify that the high availability system needs to have enough spare computing resources to handle a failure of a certain number of hosts. In one example, the user may specify that computing resources to failover the failure of two hosts are needed. In this case, spare computing resources in the resource pool needed for failover of two hosts are not used. Admission control may then be used to prevent the use of the spare computing resources. This results in inefficient hardware and power utilization.