Redundant storage systems may include multiple storage devices connected by communications links in a network. The storage systems may be accessed by nodes in a cluster, which serve data in the storage devices to client devices. Nodes within the cluster may be organized into sites. Nodes with a site use a voting mechanism to select certain nodes as responsible for maintaining the site. For example, an elected node, or primary node, may monitor and distribute resources within the cluster. Each site may select one node as a primary node.
Conventional voting systems generally select primary nodes based on a majority vote of the nodes in the site. The nodes may be weighted, such that the nodes have a different number of votes.
There are a number of failure scenarios that may be handled by the conventional majority voting scheme. For example, a cluster may include five nodes, in which a first physical site has three nodes and a second physical site has two nodes. Because the first site has three nodes, the first site is the primary site of the cluster. If the two nodes of the second physical site lose communications with the first physical site, then only a minority of the nodes of the cluster fail. Thus, decisions that require a majority of the nodes may still be made within the cluster. The three remaining nodes of an unweighted voting scheme may still constitute a majority within the cluster. Thus, the three remaining nodes may coordinate to redistribute resources previously served by the two disconnected nodes. The three nodes may remain functional even when two nodes fail.
If, however, the first physical site fails, which is the primary site, then the second physical site will also fail. Because the second physical site is not the primary site, the second physical site cannot recover from the loss of communications with the first physical site. Thus, a failure of three nodes of the cluster of five nodes causes the entire cluster to fail. The two remaining nodes comprising a minority in the cluster could otherwise continue to function in serving resources to clients but for the majority voting scheme.
In some conventional solutions, cluster voting systems described above may incorporate a common quorum device to communication, update, and exchange votes. In such an arrangement, each node is responsible for communicating with the quorum device to cast votes and read status updates regarding the state of the cluster. Such a single quorum device is also a failure point for large clusters.
It is undesirable for even a large failure affecting the majority of the nodes to result in a failure of the cluster. For example, in a cluster having a first physical site with 51 nodes and a second physical site of 49 nodes, a failure of the 51 nodes of the first physical site would cause a shutdown of the remaining 49 nodes, which would otherwise continue to function. The remaining nodes are unable to function in the site because they do not have information regarding communications paths in the cluster. Instead, only the primary node elected by the majority has information regarding resources within the cluster. Thus, a more flexible management system is necessary to properly utilize resources in redundant storage systems.