Computing clusters may provide various advantages over non-clustered computing systems. For example, clusters may provide services with high availability, enable load balancing for large computational workloads, and/or provide more computational power with less powerful constituent nodes.
A computing cluster may include one or more resources that are shared between nodes within the cluster, such as one or more storage devices. Thus, multiple nodes within the cluster may access the same data. By working in concert, the nodes of a cluster may safely access and alter data for clustered applications. The nodes of a cluster may also work in concert to reconfigure the cluster itself. For example, if a node in a cluster fails, other nodes in the cluster may attempt to take on the responsibilities of the failed node.
However, under some conditions two or more sub-clusters of a cluster may become partitioned yet still have access to shared cluster resources. In this case, each sub-cluster of the cluster may attempt to take over the responsibilities for the entire cluster without coordinating with other sub-clusters of the cluster. This may lead to undesirable results such as corrupted shared cluster resources. In order to handle such “split-brain” scenarios, a conventional computing cluster may be configured to make a fencing decision that (1) selects one of the sub-clusters to take over the responsibilities for the entire cluster and (2) isolates the other sub-clusters from shared cluster resources.
Unfortunately, some clustered computing systems may include multiple clusters and, in some cases, may include clusters that depend on other clusters to operate (e.g., a cluster of virtual machines may depend on the cluster of hypervisors on which it runs). If each cluster within these clustered computing systems performs independent fencing decisions, the independent fencing decisions may be incompatible (e.g., incompatible fencing decisions may result in the complete loss of high-availability for applications running inside dependent clusters). Accordingly, the instant disclosure identifies a need for efficiently and effectively managing sub-clusters within dependent clustered computing systems subsequent to partition events.