A cluster is a group of interconnected processing devices, for instance a group of computers or servers, which can share data and other system resources. Each device in the cluster, also referred to as a node or a cluster member, can be configured to run one or more shared applications, resulting in a network of nodes that has increased reliability over single-node networks with respect to these applications. A cluster manager, instances of which run on each cluster member, is used to control the cluster, with the aim of ensuring that the cluster remains operational to the largest extent possible whilst preventing situations that could jeopardize the integrity of shared data.
Single-instance cluster applications run on only one cluster member at a time. To make this type of application highly available, the cluster manager provides a mechanism for starting the application on another cluster member in the event that the current member can no longer run the application. Multi-instance applications can run on multiple cluster members at the same time. A multi-instance application, by definition, is highly available because the failure of one cluster member does not affect the instances of the application running on other members.
One problem with known cluster arrangements is that, when a cluster partition occurs, for instance due to inter-node communication link failures, multiple sub-groups of nodes can be formed, each attempting to reform a new cluster having the same external identity as the original cluster. This can have serious consequences for the integrity of shared data, for instance when more than one sub-group attempts to run the same single-instance application.
In order to ensure data integrity, cluster managers have been developed that operate a voting scheme to determine which sub-group will form the new cluster and to prevent the remaining sub-groups from forming clusters. Votes are allocated to each sub-group and are based on the number of nodes in the sub-group. The number of votes required for forming the new cluster, referred to as obtaining ‘quorum’, is generally required to be at least half of the original votes available, such that the cluster can be reformed only by the largest sub-group(s) To cope with the situation in which two sub-groups have equal votes, referred to as the ‘split-brain’ scenario, an arbitration device is provided, an example of which is a quorum server, connected to all nodes in the cluster. The quorum server acts as a virtual cluster member having one vote. Therefore, following a cluster partition into two equally-sized sub-groups, the quorum server allocates its vote to one of the sub-groups, allowing that sub-group to achieve quorum and reform the cluster, while the other group is denied quorum and cannot start a cluster.
However, even in arrangements having a quorum server, for the cluster to be reformed by a sub-group, it is still required that the sub-group is made up of at least half of the nodes that were present in the original cluster. This is to ensure that, in the case that a group of nodes becomes separated from the remaining nodes in the cluster as well as the quorum server, it cannot reform the cluster unless it has a clear majority of the nodes, thus avoiding multiple sub-groups from being formed.
Accordingly, in conventional cluster arrangements, the cluster can be prevented from continuing in cases where this is not necessary, for instance in the case that the cluster is partitioned, as a result of one or more failures, into more than two sub-groups. Therefore, in conventional systems, high cluster availability may be compromised to preserve data integrity.