The present invention comprises a software-based communications architecture and associated software methods for establishing and maintaining a common membership among multiple, cooperating computers. A membership refers to a set of computers, called hosts herein, which are members of a cooperating group, called a cluster. (A computer refers to a typical computing system consisting of one or more CPU's, memory, disk storage, and network connections.) Each host maintains a list, called a membership list, that enumerates this host's view of the set of the hosts within a cluster, also called the cluster set, and it modifies the list as it detects that other hosts have joined or left the group. A host can voluntarily leave the group or it involuntarily leaves after it suffers a failure and ceases to communicate with other hosts. The membership lists of various cluster hosts may temporarily differ as notifications of cluster set changes propagate among the hosts.
All hosts need to quickly update their membership lists as cluster set changes occur so that the membership lists quickly converge to the same contents in all cluster hosts. This property, called coherency, of the cluster membership enables the cluster to closely coordinate its actions, for example, to partition and distribute a shared workload among cluster members and to redistribute the work as necessary after a cluster set change occurs. An example of a shared workload is a database table that is partitioned across a cluster set based on a lookup key for the purposes of parallel searching or sorting. At a minimum, it is necessary that the hosts be able to quickly detect any single host's failure and update their membership lists to re-establish their coherency and thereby repartition the shared workload. Quick recovery minimizes disruption in processing the workload. It is also desirable that the hosts to be able to detect and recover from multiple, simultaneous host failures. A cluster membership that quickly detects and recovers from host failures so as to maintain useful processing of its workload is said to be highly available.
Highly available clusters employ a variety of techniques to increase their tolerance to failures. For example, reliable communications protocols and redundant networking hardware (e.g., adapters, switches, and routers) reduce the likelihood of communications failures, and redundant storage components (e.g., RAID drives) reduce the probability of storage failures. However, host failures may occur and must be quickly detected so that recovery can be initiated and useful processing can resume. A well known technique for detecting host failures is to use periodic message exchanges, called heartbeat connections, between hosts to determine the presence and health of other hosts. Cluster hosts periodically (e.g., once per second) send and receive heartbeat messages using a computer network that is shared among the hosts. For example, FIG. 1 shows two hosts exchanging heartbeat messages over a heartbeat connection (also called a heartbeat link herein). Minimizing the period between heartbeat messages allows hosts to detect failures more quickly and to keep the cluster membership more coherent. However, heartbeat connections add networking and computational overheads, and shorter heartbeat periods have larger overheads.
An important challenge in constructing a highly available cluster membership is to enable the membership list to efficiently grow and support large numbers (i.e., hundreds) of hosts while maximizing the membership's coherency after cluster set changes. A cluster membership that has this property is said to be scalable. To make a cluster membership scalable, it is highly desirable that the overhead associated with heartbeat connections grows less quickly than the number of hosts. In addition, bottlenecks to scaling, such as the need for a fully shared, fixed bandwidth networking medium, should be avoided. As an example, a cluster membership in which each host has a heartbeat connection with all other members would not scale efficiently because the number of periodic heartbeat messages exchanged by the hosts would grow quadratically (that is, proportional to the square of the number of hosts).
A typical method for establishing a cluster membership in small clusters (typically, two to four hosts) with is for each host in the cluster to have a heartbeat connection with all other hosts. For example, FIG. 2 shows a cluster set with four hosts and six heartbeat links. The hosts can detect whenever a change in the cluster set occurs. As noted above, the overhead required for each host to maintain membership list in this manner grows nonlinearly with the size of the cluster set. To reduce this overhead, heartbeat messages may be broadcast or multicast to all other hosts. In this case, the message traffic still grows linearly with the number of hosts. However, this approach requires that the computer network provide efficient hardware support for multicast, and the use of a shared network switch to support multicast can become a bottleneck since the switch's bandwidth is fixed. To avoid the use of multicast, all heartbeat connections can connect to a single “master” host within the cluster analogous to attaching the spokes on a wheel to its axle. This arrangement unfortunately makes the master host a single point of failure and bottleneck for the entire cluster. Hence, these communications architectures have inherent limitations that keep them from scaling to handle large cluster set sizes.
To avoid these difficulties, large cluster memberships (e.g., with hundreds of hosts) are often created by loosely aggregating many individual hosts or small clusters (called sub-clusters) whose memberships may be maintained in a manner such as that described above. To minimize overhead, these large “peer-to-peer” memberships do not maintain periodic heartbeat connections between sub-clusters. As a result, the global cluster membership can become partitioned and/or the membership lists may diverge for long periods of time when cluster set changes occur. This method for maintaining a cluster membership is not sufficiently coherent to closely coordinate the actions of all cluster members in partitioning a shared workload. It is this limitation of the prior attempts for large cluster memberships that the invention seeks to remedy.
It is important to distinguish the invention described herein from the prior attempts in the design of physical networks for communications between computers. In the prior attempts, such as in the design of parallel supercomputers, physical point-to-point computer interconnection networks have been proposed or implemented which contain nearest neighbor and multi-hop interconnections. For example, the interconnection network in FIG. 3 shows nearest neighbor and multi-hop physical communication links between hosts. These computer networks have been devised to allow the efficient construction of networking hardware to support large numbers of interconnected computers. However, these networks strictly represent a physical communications medium and do not specify a software architecture or message communication pattern that implements an intended behavior. In contrast, the invention describes a software architecture for constructing heartbeat connections between computers for the purpose of detecting and recovering from host failures in order to construct a scalable, highly available cluster membership. The heartbeat connections communicate over an unspecified underlying physical network, for example, a network such as that shown in FIG. 3, a shared Ethernet switch, a crossbar network, a mesh backplane, as some examples.