Computer clusters often include multiple nodes that are configured to collectively perform one or more computing tasks, such as providing access to various services (such as applications or databases). Unfortunately, nodes within a computer cluster may occasionally experience communication failures that prevent the nodes from collectively performing such computing tasks. For example, a node within a computer cluster may experience a network-interface-controller (“NIC”) failure that prevents the node from communicating with other nodes within the computer cluster. This communication failure may lead to a scenario (commonly known as a “split-brain” scenario) in which multiple nodes within the computer cluster attempt to individually perform similar or identical computing tasks (such as writing data to and/or reading data from a shared resource) without communicating with one another, potentially resulting in data corruption and/or application unavailability.
In order to resolve split-brain scenarios, nodes within a computer cluster may employ failure-detection technology that monitors the health and communication capabilities of each node. Conventional failure-detection technologies are typically divided into two main techniques: (1) link-based failure-detection techniques and (2) probe-based failure-detection techniques. In link-based failure-detection techniques, a node's network-interface driver may monitor the link state of the node's network interface and immediately notify the node when this link state changes. In contrast, in probe-based failure-detection techniques a node may periodically (e.g., every 16 or 32 seconds) probe or test various communication paths within the computer cluster to ensure that they are active.
Upon detecting a communication failure using either technique, a node may attempt to contact a coordination point server in an effort to secure responsibility for performing the computing tasks originally collectively performed by all nodes within the computer cluster. The coordination point server may then select a subcluster of nodes within the computer cluster that is to assume responsibility for performing the computing tasks based on a number of factors, including which subcluster includes the node that was the first to contact the coordination point server subsequent to the communication failure.
Unfortunately, the above process (often referred to as an arbitration event or “fencing race”) may be heavily biased in favor of nodes that receive link-based failure notifications. For example, since a node typically receives immediate notification upon experiencing a link-based failure (such as failure of the node's NIC), this node may identify such a communication failure several seconds before all other nodes within the computer cluster, which nodes may be unaware of the node's NIC failure until detecting the same via a periodic (e.g., 16 or 32 second-interval) probe-based failure-detection technique. Since a coordination point server generally limits responsibility for performing computing tasks to a subcluster that includes the node that is the first to contact the coordination point, these other nodes may be unfairly disadvantaged in their efforts to secure responsibility for performing the computing tasks. As such, the instant disclosure identifies a need for efficiently and effectively resolving split-brain scenarios in computer clusters by immediately notifying each node within a computer cluster of communication failures.