In computing architecture, a cluster is a term used for a set of interconnected nodes such as computers. A cluster is arranged to provide facilities such as balancing the load of processing many simultaneous computational tasks, high availability of resources such as data or application programs or parallel processing. High availability (HA) clusters are arranged to ensure the availability of the services provided by the cluster and may be implemented with redundant nodes that normally remain inactive. These inactive nodes are then used to provide the cluster services in place of any failed active nodes. The cluster management system for a given cluster is arranged to use the inherent redundancy in the cluster to eliminate single point failures.
One problem that must be managed for clusters is partitioning. Partitioning occurs when one or more nodes fail or the communication links between nodes fail. This can result in the cluster being effectively divided into two separate sub-clusters or partitions. If not properly managed, cluster partitions might compete for the cluster resources such as communication or data resources, which may result in data corruption or service degradation. Thus, when partitioning is detected, each cluster partition is arranged to identify whether or not it should provide the cluster's services or should shut down in favor of another partition.
One method for determining which partition should survive such reorganization is a quorum system in which a single partition is selected to provide the cluster's services. In some quorum systems, nodes are assigned different weights and the product of the node weights for each partition is used to determine which partition provides the cluster's services. One problem with such quorum systems is that partitions may occur with equal numbers of nodes or products of node weights. In this situation unless an additional management system is provided to deal with such a tie-break situation the cluster partitions may compete with each other, resulting in data corruption or service degradation. The additional management system increases the time taken to reorganize the cluster and the processing burden on each node.