Certain computing systems, such as network gateways, routers, and switches, are intended to provide services continually without interruption. Such computing systems are often configured as high-availability clusters that include two or more nodes that are collectively capable of providing high availability of services. For example in a typical configuration, a high-availability cluster may include one or more active nodes that actively perform computing tasks associated with the services provided by the high-availability cluster and one or more standby nodes to which computing tasks may failover in the event of an active-node failure.
In general, if a standby node detects that an active node has failed, the standby node will begin performing the computing tasks that were assigned to the failed active node. In a typical high-availability cluster, the detection of node failures is made possible by a heartbeat mechanism in which the nodes of the high-availability cluster periodically exchange heartbeat messages that indicate their health statuses. In this way, a standby node may detect that an active node has failed by detecting when expected heartbeat messages are not received from the active node.
Unfortunately, in some situations an active node and a standby node of a high-availability cluster may become isolated from one another by a partitioning event such that the active node and the standby node are healthy but unable to exchange heartbeat messages. These situations may lead to a scenario (commonly known as a “split-brain” scenario) in which a standby node of a high-availability cluster mistakenly determines that an active node has failed and attempts to simultaneously perform similar or identical computing tasks assigned to the active node, potentially resulting in data corruption and/or service unavailability. As such, the instant disclosure identifies and addresses a need for improved systems and methods for preventing split-brain scenarios in high-availability clusters.