Clustering involves multiple computer systems, also referred to as nodes, cooperating with each other in order to provide increased availability and/or load balancing. A cluster is usually viewed as a single entity by clients that use the cluster's services. In a system that is designed to provide high availability, a cluster can continue providing a particular application service to clients even if one of the nodes included in that cluster fails. In other words, programs executing on the cluster can continue to function despite a problem with one node in the cluster.
Clusters usually include resources that are shared between several of the nodes. The most common shared resource in a cluster is some form of shared data resource, such as one or more storage devices. Using a shared data resource gives different nodes in the cluster access to the same data, a feature that is critical for most clustered applications. Although a storage device is perhaps the most common example of both a shared resource and a shared data resource, a variety of other types of shared devices (e.g., SAN (Storage Area Network) switches) will be well known to those having ordinary skill in the art.
Nodes within a cluster can be connected to each other by a network (e.g., a local area network (LAN) or wide area network (WAN) such as the Internet). Such a network can also facilitate communication between the cluster and its clients. The cluster nodes communicate with each other in order to coordinate their activities (e.g., to coordinate the migration of a service from one node to another during cluster maintenance, to coordinate updates to cluster membership information when nodes enter or leave the cluster, and the like). Nodes also communicate with each other in order to detect whether one of the nodes has failed. For example, nodes can exchange ‘heartbeat’ messages, which indicate that each node is currently available to the cluster and functioning properly. Often, this cluster-specific communication is implemented using a private network that links the cluster nodes (as opposed to using the same network that is also used to communicate with clients). If the failure of a node is detected (e.g., due to that node failing to send a heartbeat message within a given interval), the application services provided by that node may be relocated to another node.
If communication between cluster nodes is disrupted, a cluster can effectively be partitioned into two or more sub-clusters. Such a communication disruption can arise for a variety of reasons, including failure of a device within the communication network itself and an excessive processing load on one node interfering with that node's ability to send heartbeat messages. As a result of the communication disruption, a node can communicate with nodes in the same sub-cluster but cannot communicate with nodes in other sub-clusters. Accordingly, the node may erroneously determine that a node in another sub-cluster is no longer present in the cluster or functioning properly, and vice versa. If these two nodes provide the same service within the cluster, there is a possibility that both will assume that the other has failed, and that each node will consequently assume that it has exclusive access to a shared resource (e.g., a storage volume) used to provide that service. If both nodes modify the shared resource, a ‘split-brain’ situation may result. Such a split-brain situation arises when two nodes, each unaware of the other, both use a shared resource (e.g., both may be executing the same application and consequentially modifying data in a shared storage device) and cause data corruption or otherwise inconsistent cluster operation to occurs.
In order to prevent split-brain situations, cluster software is often configured to shut down all but one sub-cluster upon detection of a cluster partition. Such clustering software determines whether a node has failed or whether that node is simply not communicating properly. If the latter case exists, a cluster partition has occurred and all but one of the resulting sub-clusters are shut down in order to prevent them from being able to access shared resources. While these schemes prevent split-brain situations, it is desirable to have techniques that prevent split-brain situations while also maintaining better utilization of cluster nodes after detection of a cluster partition.