A high availability system is a system that is resilient to failures of the system's components. Typically, this is achieved by providing redundant components so that if one component fails, a redundant component can take over performing the tasks of the failed component.
HA devices, such as edge nodes, may be grouped into clusters. The nodes in a cluster may work as a team to provide services even if some of the nodes fail. As long as at least one of the nodes in a cluster remains active, the cluster may provide the services configured on the nodes. Examples of the services may include load balancing, traffic forwarding, data packet processing, VPN services, DNS services, and the like.
Nodes in a cluster may operate in either an active mode or a standby mode. If a node in a cluster fails, then, if possible, a surviving node assumes an active role and provides the services that were configured on the failed node.
Unfortunately, detecting failures of nodes in node clusters is often inefficient and difficult. Typically, HA nodes in a cluster communicate with each other via Bidirectional Forwarding Detection (“BFD”) channels. However, since the BFD channel may be configured with an aggressive timer, relying on communications exchanged via the BFD channel may lead to false detections of failures. For example, when no response is received to three consecutive packets sent to a node, an aggressive timer may flag failure of the node even if the node is still healthy. This may happen because the BFD traffic is usually communicated alongside the user traffic over the same channel, and the responses from the nodes are lost due to congestion caused by a high-volume user traffic, not due to the node's failure. Nevertheless, failure to timely detect BFD control packets from the node may trigger failover even if the node is still healthy.