A computer “cluster” typically refers to a group of linked computers (also referred to herein as “hosts”) that are deployed in an aggregate, and a so-called “high availability” cluster is one in which redundant computing resources are provided in case of hardware failure.
In a virtual machine environment, each host in a cluster can support multiple virtual machines. In a high availability cluster in such a virtual machine environment, when a host fails, each of the virtual machines running on the host is re-instantiated on another host in the cluster that has sufficient resources to support such virtual machine (such re-instantiation being referred to as “failover”). Current methods of detecting host failure and performing “failover” depend upon a software agent running on each host in the cluster. These agents communicate with each other through a common network (typically, a private network that differs from a network utilized by the virtual machines to provide services) to coordinate activity, such communication including selecting one or more “primary” agents having the responsibility of: (a) synchronizing cluster state and configuration information across the cluster, (b) monitoring the condition of hosts in the cluster (e.g., by receiving TCP messages from the hosts that indicate “liveness”), and (c) directing the initiation of failover upon detecting a failure.