1. Field of the Invention
The present invention relates to computer clusters and quorum determination methods for cluster partition recovery. More particularly, the invention concerns a quorum determination technique that takes into account server node application state information in addition to conventional cluster majority considerations, and which provides an interface whereby quorum determination rules can be programmed by cluster applications.
2. Description of the Prior Art
By way of background, managed data processing clusters are commonly used to implement the server tier in a client-server architecture. Instead of a single server providing application services to clients, application service functions are shared by an interconnected network of server nodes (server cluster) operating cooperatively under the control of cluster management software. Responsibilities of cluster management software commonly include the coordination of cluster group membership changes, fault monitoring and detection, and providing the server node application layers with distributed synchronization points so that the servers can implement a cluster application tier that provides a clustered service. Clustered services are advantageous because plural server nodes can share application workloads and thus improve data processing performance. Even if the server nodes run individual applications and do not share application workloads, the loss of a server node will not ordinarily bring down its applications because the cluster management software can transfer the lost server's functions to another server node. Exemplary applications that can be run by a server cluster include network file systems, distributed databases, web servers, email servers, and many others.
Notwithstanding the enumerated advantages of server clusters, such networks are prone to a phenomenon known as “partitioning” wherein there is a failure of a cluster server node or a communication link between server nodes that disrupts cluster operations. As its name implies, partitioning means that the cluster server nodes have lost the ability to interoperate as a single group and instead divide into two or more separately functioning subgroups. This creates problems because each subgroup acts without regard to the other and data corruption can result if the subgroups attempt to run the same applications or control the same devices (such as data storage systems). In order to properly recover from a partition event, it is usually necessary to allow only one of the functioning subgroups to continue server operations, while all other subgroups are deactivated from service until the problem that caused the partitioning is resolved.
The conventional technique used to recover functionality in a partitioned cluster is to perform a quorum management operation that attempts to identify the largest remaining subgroup. In a typical quorum management scheme, each cluster server node is assigned a number of votes. Following partitioning, all of the operational server nodes within each subgroup respectively pool their votes. The subgroup that has the most votes is permitted to form a new cluster and assume all server duties. In the event of a tie, a quorum resource, such as a shared data storage device whose access is not impacted by the fault that induced the partition, and which can be seen by all subgroups, can be used as a “tie breaker.” The first operational subgroup to acquire a lock on the quorum resource is given an extra vote, and thereby determined to have a quorum.
A disadvantage of current quorum management techniques is that they do not take into account the operational state of each subgroup relative to its application tier, such as the number of connected clients, the applications being served, the ability to satisfy external resource dependencies, subgroup processing capability, memory availability, I/O (Input/Output) resource availability, etc. The failure to consider such information can have adverse consequences. For example, there will be unacceptable disruption of end-to-end application service availability if cluster recovery results in a majority (or even all) of the application clients ending up on the wrong side of the partition (i.e., connected to a subgroup that does not have a quorum and unable to communicate with the subgroup that does have the quorum). Serious consequences can also result if the partitioned subgroups service applications with differing availability requirements (e.g., low importance applications versus a high priority business critical application), and a quorum is denied to the subgroup running the high priority application simply because the high priority application runs on a server node in a minority subgroup. The manageability of a recovered cluster will likewise be compromised if the original cluster relied on an external service such as a directory or administration server (e.g., for managing user and authentication information) and a quorum is won by a subgroup that does not have access this external resource. Ignoring information such as the aggregate subgroup processing capability, memory availability, I/O (Input/Output) resource availability, etc., can also result in less than optimal partition recovery.
It is to improvements in cluster quorum determination techniques that the present invention is directed. In particular, what is needed is a quorum determining methodology that takes into account factors beyond the traditional majority approach when recovering a partitioned cluster. In particular, it would be desirable to move away from a cluster-centric approach wherein quorum determination solutions are dictated solely by cluster management concerns to solutions that take into account the needs of cluster applications and their clients.