Some networks deploy a pair of active and standby network nodes (e.g., network appliances such as switches or gateways) to provide high availability (HA) and ensure that the services provided by the network nodes are available even when a hardware or software failure renders a single node unavailable. One of the nodes functions in active mode and actively forwards traffic and provides other network services. The other network node functions in a standby state, waiting to take over should the active node fail. The active and standby nodes maintain a heartbeat with each other, e.g. by relying on a bidirectional forwarding detection (BFD) protocol.
A split-brain condition occurs when both nodes go active and serve the same set of users while the users should have one system in active and the other in standby mode. The condition can occur when the standby node does not receive a heartbeat from the active node within a specified time and the standby node also declares itself as an active node. The condition can also occur when the active node goes down, the standby node declares itself active, and the previously active node becomes functional at a later time. When both nodes switch to active mode, one node has to go back to the standby mode while the other remains in active mode. The nodes are typically assigned a rank or priority and the node with the lower rank is forced to go back to the standby mode. The ranks are configured by the user and do not change.
To describe the problem with this approach, assume that before the split-brain condition, node A has a higher rank than node B. Node A is, therefore, active and node B is standby. If node B does not receive the heartbeat from node A, both nodes become active. If node A subsequently fails, node B continues to be the active node. After node A recovers, node A may become active and node B may become standby since node A is assigned a higher rank. In other words, the higher rank node takes over and becomes the active node no matter which node was active prior to the split-brain condition. The desired healing procedure is for node B to remain active (because it is active to start with), and node A to go back to standby. The preemptive flipping of the active node from node B to node A in this case is unnecessary. For instance, even though the user assigns a different rank between the two nodes, the nodes may be equally equipped in terms of the processing, memory and other resources.
Another approach to resolve the split-brain condition is the use of a third entity. When no priority is assigned to the nodes, the two nodes may enter into a tie with both wanting to go active. This situation is resolved by using the third entity (e.g., the management plane and/or the control plane of the network) to break the tie. The use of the third entity to break the tie raises other issues. The third entity may be located in the management plane and/or the control plane of the network and messages from the network nodes to the third entity may get lost, the third entity may get an incorrect state regarding the state of the active and standby nodes, etc.