A common type of network that is deployed in organizations is the client-server network. In a client-server network, there may be a number of client computing devices, or clients, which are typically used by end users of an organization, and a number of server computing devices, or servers, which are computing devices that are shared among the clients, and thus the users. Types of servers can include application servers, file servers, intranet servers, e-mail servers, electronic commerce servers, print servers, proxy servers, and web servers, among other kinds of servers.
To leverage the servers within a network, the servers may work together as a cluster. Clustering generally refers to multiple servers that are linked together in order to handle variable workloads or to provide continued operation in the event one fails. Each server may be a multiprocessor system itself. A cluster of servers can provide fault tolerance, load balancing, or both. Fault tolerance means that if one server fails, one or more additional servers are still available. Load balancing distributes the workload over multiple servers.
In a given cluster of servers, usually one of the servers is assigned or appointed the leader of the cluster. The leader of the cluster may be statically determined a priori by a network administrator, or, as is more common, may be dynamically determined among the servers themselves during startup. In the latter scenario, each of the servers may upon startup determine whether there is a leader of the cluster, and if there is no leader, try to become the leader of the cluster. Once one of the servers has established leadership, the other servers of the cluster stop attempting to acquire leadership.
After startup, however, leadership of a cluster of servers may need to be redetermined. The current leader of the cluster may fail in such a way that it can no longer be the leader. For example, such a server may crash, or its communication link with the other servers may fail. Often a cluster of servers may fail where the servers are undesirably divided, or partitioned, into two or more groups, or partitions, that are unable to communicate with one another. For example, a switch or other type of networking device connecting all the servers of a cluster together may fail in a such a way that the servers are effectively divided, or separated, into two or more such groups.
When a cluster of servers becomes divided into two or more groups that are unable to communicate with one another, leadership of the cluster is usually redetermined. In effect, one of the groups of servers becomes the acting cluster, whereas the servers of the other groups no longer participate in the cluster. The group of servers that becomes the effective, or acting, cluster has one of its servers become the leader of the cluster. Stated another way, the server that becomes the new leader of the cluster effectively causes the cluster to be redefined as those servers that are part of the group of servers that includes the new leader.
Different protocols exist to determine which server becomes the new leader of a cluster when the cluster becomes divided into two or more separate groups. In one common approach, each server sends network messages to the other servers to determine the size of the group, or partition, of which the server is now a part. The servers of the group that includes a majority of the servers of the clusters then send network messages to one another to appoint a new leader of the cluster.
For example, a cluster of ten servers may become divided into one group of four servers and another group of six servers. By communicating with one another, the servers each determine that they are part of either the former group or the latter group. Because the servers know that there were originally ten servers within the cluster, the servers that conclude that they are part of the group of six servers send network messages to one another to appoint a new leader of the cluster. The four servers that are not part of the new acting cluster generally do not perform any further activity or functionality until the fault that resulted in the division of the original cluster is corrected.
However, so-called majority-based network-messaging protocols are not effective in many situations. A cluster of servers may become divided into groups that have the same number of servers. In the previous example, for instance, the cluster of ten servers may instead become divided into two groups of five servers. In such instance, majority-based protocols have no way to determine which group of servers should become the dominant group within the cluster, and thus from which group a leader should be appointed for the cluster. Majority-based protocols are also ineffective for clusters of two servers, since such clusters can inherently be divided only into two groups of a single server apiece.
Majority-based protocols may further be undesirable when the number of servers is not the most important factor in sustaining a divided cluster. For example, a cluster of ten servers may have been responsible for the processing needs of one hundred clients. Where the cluster becomes divided into one group of six servers and another group of four servers, the cluster division may also have resulted in ninety of the clients being connected only to the group of four servers and ten of the clients being connected only to the group of six servers. Assuming that all the clients are of equal importance, it would be undesirable to redefine the cluster as the group of six servers, since this larger group of servers only is able to serve ten clients, whereas the smaller group of four servers is able to serve ninety clients.
Furthermore, prior art non-majority-based, non-network-messaging protocols have their own drawbacks. Such protocols may be storage-based, in that they appoint leaders of clusters by having the servers of a given cluster write to disk sectors of a storage, like a hard disk drive or a storage-area network (SAN). The Small Computer System Interface (SCSI) 2 specification provides for such a storage-based protocol, but it does not ensure persistent locking. Persistent locking means that once a lock corresponding to cluster leadership has been acquired by a given server of node, it is guaranteed to retain the lock unless and until cluster leadership needs to be redetermined. For instance, within the storage-based protocol of the SCSI 2 specification, power cycling of the storage system can cause a loss of lock acquisition by one of the servers within the cluster, even if a new leader for the cluster does not have to be redetermined. By comparison, the SCSI 3 specification provides a storage-based protocol that ensures persistent locking. However, this protocol requires consistent implementation by storage vendors, which does not occur with regularity, and thus is not a mature technology. As such, the protocol can cause problems when heterogeneous SAN-based storages are used that have storage devices from different vendors.
Other prior art storage-based protocols are based on Leslie Lamport's “A Fast Mutual Exclusion Algorithm,” as published in the February 1987 issue of the ACM Transactions on Computer Systems. Storage protocols that directly use Lamport's algorithm cannot be employed within the context of storage-area networks (SAN's), limiting their usefulness. These protocols cannot be used within the context of SAN's, because Lamport's mutual exclusion algorithm requires an upper bound on input/output (I/O) reads and writes—that is, an upper bound on the length of time a given read or write will take—whereas SAN's do not provide for such an upper bound.
A limited solution is to use the length of time it takes for a SCSI timeout be the upper bound. A timeout is an intentional ending to an incomplete task. For instance, if a requesting node issues read or a write request to a SCSI hard disk drive, and if confirmation of that request is not received from the SCSI hard disk drive within a given period of time, or “timeout,” then the node assumes that the SCSI hard disk drive did not receive or could not complete the given request. By timing out after this given period of time, the requesting node thus does not wait indefinitely for the confirmation of the request from the SCSI hard disk drive. However, SCSI timeouts are usually on the order of thirty seconds, and can vary by hard disk drive vendor, which means that such protocols can take an undesirably long time to select the leader of a cluster.
Other prior art storage-based protocols have adapted Lamport's algorithm for SAN's. One such protocol adapts Lamport's algorithm to use as many sectors of a storage as there are servers, or nodes, in the cluster. This solution does not scale well in terms of storage space used, however, since an inordinately large number of disk sectors, and thus an inordinately large amount of storage space, may be required. Another adaptation uses two sectors as the original Lamport algorithm does, and increases various predetermined delays in the algorithm in which nodes wait for other nodes to acquire the lock on cluster leadership. Such protocols treat these increased delays as disk leases, in which a given node is said to be the current leaseholder of a sector of a disk, and is the only node allowed to write to that sector, while it maintains the disk lease for that disk. However, such adaptations of Lamport's algorithm suffer from the problem of one node overwriting what has been written by another node at the penultimate moment prior to acquiring the lock on the leadership of the cluster, which can result in two nodes each believing that it is the cluster leader. Using larger delays of the order needed by disk leases also requires tuning for every different storage type and SAN configuration.
Furthermore, protocols based on Lamport's mutual exclusion algorithm do not guarantee that a cluster leader will be selected should most of the servers within the cluster fail or crash. Protocols based on Lamport's algorithm also do not provide sustained locking semantics. Sustained locking semantics are semantics, or methodologies or approaches, that a lock-holding server, as the leader of a cluster, is to periodically perform to maintain acquisition of the lock, and thus to sustain its leadership of the cluster. Sustained locking semantics are needed due to the potential of overwriting disk sectors when multiple servers, or nodes, can asynchronously access the sectors of the disks in the same shared storage. Such protocols thus do not force the leader of a cluster to assert and maintain its leadership of the cluster, which is undesirable.
For these and other reasons, therefore, there is a need for the present invention.