Conventionally, a cluster system typically includes a centralized availability manager to monitor nodes in the cluster system to ensure that the nodes are operational. FIG. 1A illustrates one conventional cluster system adopting such a scheme. The system 100 includes a centralized availability manager 110 and two bare nodes 120 and 130, coupled to each other via a network 140. Each of the bare nodes 120 and 130 is implemented on a physical computing machine. Multiple virtual machines (e.g., virtual machines 125) are emulated on each of the bare nodes 120 and 130. Each virtual machine in the system 100 can be considered as a node as well.
In the cluster system 100, the centralized availability manager 110 healthchecks each node (which may be a virtual machine or a bare node) to ensure that the software running on that virtual machine or bare node is operational and no software or hardware fault has occurred. With thousands of nodes in some conventional cluster systems, and hundreds of virtual machines per bare node, the centralized availability manager 110 can consume more bandwidth than is available, just for healthchecking operations. In one implementation, the healthchecking bandwidth requirements are given by the equation:B=(N*M*D)/P, where B is bandwidth required, N is the number of bare nodes, M is the number of virtual nodes, D is the amount of data transferred per healthcheck, and P is the periodicity of the healthcheck. Using the above equation, the bandwidth requirements for even small scale cluster systems, which include merely thousands of nodes, can be significant, as shown in the graph 190 illustrated in FIG. 1B.