1. Field of the Invention
The present invention relates to a method or apparatus for selecting a cluster in a group of nodes, and more particularly to assigning, identifying, and selecting a subgroup in a group of nodes.
2. Description of the Prior Art
Computer systems which need to be highly reliable both in terms of service availability and data integrity are commonly implemented using cluster architecture. A cluster is made up a group of interconnected computers (nodes) running cluster software which enables the group to behave like a single computer. The nodes communicate with each other via a set of network connections referred to as a cluster interconnect. A cluster will generally have shared data storage devices connected to the nodes via a shared storage bus. The cluster software running on each node is arranged so that in the event of failure of any node in the cluster, the functions and services provided by the cluster are unaffected.
Failures can occur in the nodes themselves or in the cluster interconnect. In the event of a failure in the cluster interconnect, the cluster becomes split into subgroups of nodes, each unable to communicate with other subgroups. In such circumstances, the cluster software is arranged to spontaneously reorganize the subgroups to form one or more new candidate clusters. The largest candidate cluster is self selected to continue to provide the cluster functions and services. Each node knows the total number of nodes in the system and this data is used by each candidate cluster to determine whether the number of nodes it contains makes it the largest cluster. However, if two candidate clusters are the same size then this method can result in more than one cluster considering themselves to be the largest. In this case more than one cluster can accesses the cluster data set and compromise the integrity of that data.
In order to deal with this problem, some systems use a predetermined hardware element, such as a disk drive, as a tie breaker. This chosen hardware element is connected to the shared storage bus and thus connected to all nodes in the cluster. In the event of a failure in the cluster interconnect, the candidate which acquires access to the hardware first during the reorganization of nodes forms the cluster. In other words, given subgroups of the same size, the subgroup which is first in communication with the specified hardware is chosen to continue as the cluster. However, using a hardware element in this way can increase the overall hardware costs of the cluster system. Also, accessing the hardware element increases the network activity and processing complexity during the node reorganization process.