Current High Availability (HA) cluster server products use heartbeating methods to monitor the health and status of each host computer in a cluster of host computers. These heartbeating methods may involve the utilization of both network heartbeats and storage heartbeats to identify failures as well as suppress false alerts.
One particular heartbeating method of interest uses a single elected host computer as a master node to monitor all other host computers that function as slave nodes. In this method, the slave nodes send network heartbeat messages to the master node periodically, e.g., every 1 second, and the master node replies to the messages. If a network heartbeat message is missing from any slave node, the master node determines the state of that slave node via an Internet Control Message Protocol (ICMP) ping check and a storage heartbeat check. If both checks fail, then the master node triggers a failover for that slave node. When a network heartbeat reply is missing from the master node, an election will take place and a new master node will take control of the ping check and the storage heartbeat check with respect to the old master node. If these checks fail, the new master node starts a failover operation for the old master node.
The above heartbeating method works well for cases of slave node failures. However, there are concerns with respect to this heartbeating method for cases of master nod failures and network isolations or partitions.
For master node failures, the above heartbeating method does not respond as quickly as desired. By design, cluster-level actions, such as failovers, are triggered by the master node. Thus, in case of master node failure, a new master node must be elected before any action can be taken, which causes significant delay (e.g., 30 seconds). In addition, since only the master node maintains the overall inventory of the whole cluster while slave nodes only know about the local inventory, the new master node must collect inventory information from all the slave nodes before any action can be taken, which causes additional delay (e.g., about 5-10 seconds). Typically, responses to master node failures are approximately 30-45 seconds slower than responses to slave node failures. Moreover, if the new master node fails to quickly collect the latest inventory information, the new master node is unable to determine slave nodes that have just failed, which may cause inconsistencies or false alerts.
Similarly, for network partition/isolation cases, the above heartbeating method does not respond as quickly as desired. In network partition cases, nodes are separated into groups and some slave nodes cannot communicate with the master node. Thus, a new master node must be elected in each partition before any action can be taken, which causes significant delay. In network isolation cases, all the nodes cannot communicate with each other. Thus, all the nodes must take time to detect the isolation, which may include a silent election and several network communication timeouts. Typically, responses to network partitions or isolations are approximately 45 seconds slower than responses to slave node failures.
Another concern with the above heartbeating method is that the method is not well suited for supporting a large scale cluster. This is due to the fact that only one master node monitors heartbeats for all the slave nodes in the cluster. As the number of slave nodes increases, the performance of the method is reduced accordingly. Thus, there is a practical limitation to the size of the cluster that can be supported by the method.
Throughout the description, similar reference numbers may be used to identify similar elements.