Clustered computing systems can be used to reliably store data, host applications, and otherwise provide reliable storage. In a clustered computing system, two or more cluster nodes are connected in a distributed architecture, where the cluster nodes operate in a coordinated manner to respond to requests from host systems (e.g., end-user computing devices, external distributed computing systems, etc.). In a clustered computing system used to implement a distributed storage system, each cluster node may be a storage controller that is used to manage physical or logical storage volumes. Two or more cluster nodes are communicatively coupled to form a cluster. Each of the nodes in the cluster can communicate with the other nodes in the cluster.
A clustered architecture allows convenient scaling through the addition of more cluster nodes, which are capable of communicating with each other. In some cases, a storage cluster may present a single system image of stored data to hosts and administrators. For example, from the perspective of the host system that accesses a distributed storage area network (“SAN”) environment, a given cluster may appear as a single logical unit, which can be identified by a logical unit number (“LUN”). The host system is not aware of the existence of logical sub-units of the distributed LUN (e.g., nodes in the cluster represented as the LUN). Thus, if a data operation in the SAN involves access to a particular logical sub-unit, this data operation and the actual location of the data is transparent to the host system.
However, as the number of nodes increases, maintaining a consistent single cluster configuration across the cluster nodes of the cluster becomes a challenge, since management and control operations may be performed on multiple nodes simultaneously. Thus, the consistency of the cluster configuration may be negatively impacted if communications between cluster nodes experiences intermittent failures.
For example, communication problems within a cluster can cause nodes that have lost connectivity with the rest of the cluster to become unresponsive. This can result in one or more partitions of the cluster. A partition is a condition in which not all of the nodes in a cluster can communicate with one other. For example, a node in a subset (i.e., a partition) of the nodes may only be able to communicate with other nodes in the subset. Nodes within a partition can maintain connectivity between each other. But the nodes within a first partition cannot be accessed via nodes of a second partition. This can result in a “split brain” configuration, in which one or more nodes in a cluster may operate autonomously from other nodes rather than all cluster nodes operating in a coordinated manner.
In some cases, a quorum-based fencing algorithm can be used to reduce the negative impacts associated with a split-brain configuration. For example, a “quorum” of nodes (i.e., a partition having the largest number of nodes in communication with one another) can be selected as the “active” partition. Out-of-quorum nodes can be fenced by preventing host systems or other devices from accessing resources (e.g., logical storage units) that are owned or managed by the out-of-quorum nodes. But using a quorum-based procedure for selecting an active partition may not properly account for certain in-use resources being owned or managed by an out-of-quorum node. For example, from the perspective of a host system, the fenced node whose access is blocked (i.e., the out-of-quorum node) may still have active logical interfaces that have been used by the host system to access a logical storage unit. The host system therefore expects to be able to access data from one or more logical storage unit owned by the fenced node. Thus, existing quorum-based solutions may reduce the availability of cluster-managed resources to host systems.
With respect to these and other considerations, improvements are desirable for managing the availability of resources in a distributed storage environment or other clustered computing system that has experienced a partition or other intra-cluster communication disruption.