Adapter and node liveness determination lie at the heart of any highly available distributed cluster system. In order to provide high availability services, a cluster system should be able to determine which nodes, networks, and network adapters in the system are working. Failure in any such component should be detected, informed to higher level software subsystems and if possible recovered from by the cluster software and applications.
Determination of node, network, and network adapter liveness is often made through the use of daemon processes running in each node of the distributed system. Daemons run distributed protocols and exchange liveness messages that are forced through the different network paths in the system. If no such liveness messages are received within a predetermined interval then the sending node or network adapter is assumed by the others as having failed (“died”).
In a high-availability cluster, precise determination of adapter, network, and node events is crucial, since a cluster recovery manager subsystem will react to such events in an attempt to give the appearance to the end-user that cluster resources are still available. For example, if a node in the cluster fails, the cluster manager transfers any resources being hosted or controlled by the failed node to another node which is still functioning. In such cases, if a given node is detected as down, then the correct behavior of the cluster depends on that node actually being down. Otherwise there will be two nodes in the cluster both trying to control the same resource. Such resource concurrency may have devastating effects for the cluster, especially if the resource is a disk, in which case the result may be a corrupted file system.
Because the detection of failed nodes or network adapters is based on missing periodically sent liveness messages, the time it takes to detect a failure is related to how many liveness messages are allowed to be missed before a node is declared as being down. Detecting a failure quickly requires lowering the threshold for missed messages, but this approach has a downside. If the network has a short-lived outage, or the sending node's daemon is unable to be scheduled during a period, a node may fail to send its liveness messages, possibly resulting in the remote node erroneously declaring the initial node as down (a “false node down” situation). Such occurrences have a negative impact on the cluster, since it forces the cluster manager to recover from the perceived failure by moving resources to another node. In this regard, it should be fully appreciated that the shifting of resources can be both time consuming and consumptive of resources in its own right.
To alleviate the problem, the threshold for missed messages is usually made high enough so that “short term outages” do not result in false “node down” indications but rather having the penalty of a longer period between a failure and when it is detected by the remote node. During such a period, the cluster is not providing services to its external users.