The invention relates generally to computer network servers, and more particularly to computer servers arranged in a server cluster.
A server cluster ordinarily comprises a group of at least two independent servers connected by one or more networks and utilized as a single system. The clustering of servers provides a number of benefits over independent servers. One important benefit is that cluster software, which is run on each of the servers in a cluster, automatically detects application failures or the failure of another server in the cluster. Upon detection of such failures, failed applications and the like can be terminated and restarted on a surviving server.
Other benefits of clusters include the ability for administrators to inspect the status of cluster resources, and accordingly balance workloads among different servers in the cluster to improve performance. Such manageability also provides administrators with the ability to update one server in a cluster without taking important data and applications offline for the duration of the maintenance activity. As can be appreciated, server clusters are used in critical database management, file and intranet data sharing, messaging, general business applications and the like.
When operating a server cluster, the cluster operational data (i.e., state) of any prior incarnation of a cluster needs to be known to the subsequent incarnation of a cluster, otherwise critical data may be lost. For example, if a bank""s financial transaction data are recorded in one cluster, but a new cluster starts up without the previous cluster""s operational data, the financial transactions may be lost. To avoid this, prior clustering technology required that each server (node) possess its own replica of the cluster operational data on a private storage thereof, and that a majority of possible nodes (along with their private storage devices) of a cluster be operational in order to start and maintain a cluster. This ensured that at least one node in any given set of nodes in a cluster was common to any previous cluster and thus the cluster had at least one copy of the correct cluster operational data. Further, the majority (quorum) requirement ensures that only one incarnation of the cluster exists at any point in time, e.g., two non-communicating subsets of the cluster membership cannot form two different instances of the cluster at the same time.
However, requiring a quorum of nodes in order to have a cluster has the drawback that a majority of the possible nodes of a cluster has to be operational in order to have a cluster. A recent improvement described in U.S. patent application Ser. No. 08/963,050, U.S. Pat. No. 6,279,032 issued on Aug. 21, 2001 entitled xe2x80x9cMethod and System for Quorum Resource Arbitration in a Server Cluster,xe2x80x9d assigned to the same assignee and hereby incorporated by reference herein in its entirety, provides the cluster operational data on a single quorum resource, typically a storage device, for which cluster nodes arbitrate for exclusive possession. Because the correct cluster operational data is on the quorum resource, a cluster may be formed as long as a node of that cluster has exclusive possession of the quorum resource. Also, this ensures that only one unique incarnation of a cluster can exist at any given time, since only one node can exclusively possess the quorum resource. The single quorum resource solution increases cluster availability, since at a minimum, only one node and the quorum resource are needed to have an operational cluster.
Another improvement is described in U.S. patent application Ser. No. 09/277,450, now U.S. Pat. No. 6,401,120 issued on Jun. 4, 2002 entitled xe2x80x9cMethod and System for Consistent Cluster Operational Data in a Server Cluster Using a Quorum of Replicas,xe2x80x9d assigned to the same assignee and hereby incorporated by reference herein in its entirety. In this improvement, the quorum resource is not limited to a single resource, but rather is comprised of multiple replica members, and a cluster may be formed and continue to operate as long as one server node possesses a quorum (majority) of the replica members. In addition to increasing availability by requiring only one operational node to have a cluster, this increases reliability, since the quorum resource is replicated on a number of devices, whereby a single (e.g., disk) failure will not shut down the cluster.
In clustering technology, a problem sometimes arises when cluster nodes lose their ability to communicate with other cluster nodes, e.g., due to a communications failure or some other type of failure such as the crash of a node. When this occurs, the original cluster is partitioned into two or more subgroups of nodes, in which each subgroup cannot communicate with each other subgroup. Because there is no ability to communicate, a subgroup has no knowledge of the existence of other subgroups, e.g., whether a non-communicating node (or nodes) is a failed node or is operational but is in a subgroup that is isolated by a communications break. When a cluster is partitioned by the loss of communication with one or more nodes, the nodes in each operational subgroup run a protocol to determine which nodes are part of that subgroup.
In order to allow the cluster to continue operating following such a partitioning, one, but only one of the subgroups needs to survive to represent the cluster, while other subgroups (if any) should halt operation and then attempt to rejoin the surviving subgroup. Formerly, this required a majority of the original (pre-partitioned) number of nodes, so it was simple for a subgroup to essentially count its nodes and determine whether it had enough to continue as the cluster. However, when using exclusive possession of a quorum resource as a tie-breaking mechanism to determine representation of the cluster, a majority of nodes is not a requirement. As a result, one or more subgroups may be capable of representing the cluster. For example, if the partition was caused by the failure of some nodes, only one subgroup may be operational, and thus that subgroup should attempt to represent the cluster. At other times, a partitioning may result in multiple subgroups remaining operational, each of which is capable of representing the cluster, even though only one subgroup is allowed to survive. In the case of multiple subgroups remaining operational, one subgroup may be preferred over the others as the choice to be the surviving subgroup, for example, because the preferred subgroup contains more nodes. Alternately, several operational subgroups may be equally desirable candidates to survive. Because the subgroups are unable to communicate with one another, they cannot directly agree on which subgroup is the preferred choice to survive to represent the cluster.
Briefly, the present invention provides a method and system wherein following a partitioning of a cluster, each operational subgroup makes an attempt (via an elected leader node therein) to secure possession of the quorum resource that determines cluster representation, wherein the attempt is biased by a relative weight of the subgroup. The weight may be relative to the original cluster weight, or submitted as a bid that is relative to other possible subgroup weights. This ensures that every operational subgroup makes an attempt to represent the cluster, while at the same time enabling a subgroup that is better capable of representing the cluster to do so over lesser subgroups.
In one implementation, the biasing weight is determined solely by node count. Each subgroup""s attempt to secure possession of the cluster resource is then delayed based on the number of nodes in the subgroup relative to the original cluster number, i.e., the more nodes in a subgroup, the shorter that subgroup""s elected leader node delays before attempting to secure possession of (arbitrate for) the quorum resource. In this manner, the subgroup with the largest number of nodes will (ordinarily) survive to represent the cluster, since in general, the more nodes in a cluster, the xe2x80x9cbetterxe2x80x9d the cluster. Also, the delay time of a xe2x80x9cguaranteedxe2x80x9d best subgroup (e.g. one containing a majority of the cluster nodes) is preferably zero to expedite its representation of the cluster.
In other implementations, the relative weight of the cluster may be determined by other criterion or criteria, which may include the number of nodes, and/or the subgroup""s resources, such as the subgroup""s non-volatile storage space, processing power, random access memory, and so forth. Each of the criteria may be weighted differently. Also, the attempt may be biased in another way, such as by having each subgroup submit a bid based on its relative weight to an entity with which the subgroups can communicate (generally the quorum resource itself) that selects an arbitration winner based on the bid.