The problems associated with providing membership services in a distributed computer system have generated a considerable amount of interest in both academic and industrial fronts. The Parallel Database (PDB) system available from Sun Microsystems, Inc. of Palo Alto, Calif., being a distributed system, has used the cluster membership monitor to provide mechanisms to keep track of the member nodes and to coordinate the reconfiguration of the cluster applications and services when the cluster membership changes. Herein, we define the general problem of membership in a cluster of computers where the nodes of the cluster may not be fully connected and we propose a solution to it.
The general problem of membership can be encapsulated by the design goals for the membership algorithm that are outlined below. We will further describe the problems that we are trying to address after we state these goals.
1. A uniform and robust membership algorithm regardless of the system architecture that is able to tolerate consecutive failures of nodes, links, storage devices or the communication medium. Stated in other words, no single point of failure should result in cluster unavailability. PA0 2. Data integrity is never jeopardized even in the presence of multiple and simultaneous faults. This is accomplished by:
(a) Having only one cluster with majority quorum operational at any given time. PA1 (b) The cluster with majority quorum should never reach inconsistent agreement. PA1 (c) Removal of isolated and faulty nodes from the cluster in a bounded time. PA1 (d) Timely fencing of non-member nodes from the shared resources. PA1 Failure of a Node: When a node fails, the remaining nodes will initiate a cluster reconfiguration resulting in a membership that will not include the failed node. PA1 Joining of a Node: A node can join a cluster after the node is restarted and after other members of the cluster accepted it as a new member, following a reconfiguration. PA1 Voluntary Leave: A node can leave the cluster voluntarily, and the remaining members of the cluster will reconfigure into the next generation of the cluster. PA1 Communication Failures: The cluster membership monitor handles communication failures that isolate one or more nodes from those nodes with a majority quorum. Note that the detection of the communication failure, i.e. detecting that the communication graph is not fully connected, is the responsibility of the communication monitor which is not part of the membership monitor. It is assumed that the communication monitor will notify the membership monitor of communication failures and that the membership monitor will handle this via a reconfiguration. PA1 Node Failures: A node is considered to have failed when it stops sending its periodic heart-beat messages (SCI or CMM) to other members of the cluster. Furthermore, nodes are considered to behave in a non-malicious fashion, a node that is considered failed by the system will not try to intentionally send conflicting information to other members of the cluster. It is possible for nodes to fail intermittently, as in the case of a temporary deadlock, or to be viewed as failed by only part of the remaining system, as in the case of a failed adaptor or switch. The cluster membership monitor should be able to handle all these cases and should remove failed nodes from the system within a bounded time. PA1 Communication Failures: The private communication medium may fail due to a failure of a switch, a failure of an adaptor card, a failure of the cable, o failure of various software layers. These failures are masked by the cluster communication monitor (CCM or CIS) so that the cluster membership monitor does not have to deal with the specific failure. In addition, the cluster membership monitor will either send its messages through all available links of the medium. Hence, failure of any individual link does not affect the correct operation of the CMM and the only communication failure affecting CMM is the total loss of communication with a member node. This is equivalent to a node failure as there are no physical paths to send a heart-beat message over the private communication medium. It is important to note that in a switched architecture, such as in the 2.0 release of Energizer, the failure of all switches is logically equivalent to the simultaneous failure of n-1 nodes where n is the number of nodes in the system. PA1 Device Failures: Devices that affect the operation of the cluster membership monitor are the quorum devices. Traditionally these have been disk controllers on the Sparc Storage Arrays (SSA's), however, in some distributed computer systems, a disk can also be used as a quorum device. Note that the failure of the quorum device is equivalent to the failure of a node and that the CMM in some conventional systems will not use a quorum device unless it is running on a two node cluster. PA1 Loss of one of its nodes. PA1 Loss of the private communication medium. Note that this case is logically equivalent to the loss of one node. PA1 Loss of the quorum device. PA1 Loss of one of its nodes and the communication medium. Note that this case is logically equivalent to the loss of one node.
The hardware architecture of some conventional distributed computer systems poses specific problems for the membership algorithm. For example, consider the configuration shown in FIG. 1. In this figure each of nodes 100A-D is supposed to be connected to two switches 101-102; however, there are two link failures that effectively disallow nodes 100A and 100D from communicating with each other. Some conventional membership algorithms are not capable of dealing with such a failure and will not reach an agreement on a surviving majority quorum. Those algorithms assume that the nodes are fully connected and do not deal with the problem of a partitioned network. What is needed is a generalized algorithm that deals with the issue of a partitioned network as well as networks that are not partitioned.
Further complications arise when we need to make decisions about split-brain, or possible split-brain situations. For example, consider the configuration shown in FIG. 2. In this configuration if the communication between nodes {200A, 200B} and {200C, 200D} is lost so that there are two sub-clusters with equal number of nodes, then the current quorum algorithm may lead to the possible shut-down of the entire cluster. Other situations during which the current algorithm is not capable of dealing with include when there are two nodes in the system and they do not share an external device.
The above examples illustrate a new set of problems for the membership and quorum algorithms that were not possible under the more simplistic architecture of some conventional distributed computer systems where a fully connected network was assumed. Our approach to solving these new problems is to integrate the membership and quorum algorithms more closely and to provide a flexible algorithm that would maximize the cluster availability and performance as viewed by the user.
A further impact of the configuration of external devices is the issue of failure fencing. In a clustered system the shared resources (often disks) are fenced against intervention from nodes that are not part of the cluster. In some distributed computer systems, the issue of fencing was simple due to the fact that only two nodes existed in a cluster and they were connected to all the shared resources. The node that remained in the cluster would reserve all the shared resources and would disallow the non-member node from accessing these resources until that node became part of the cluster. Such a simple operation is not possible for an architecture in which all disks are not connected to all nodes. Given that the SPARC Storage Arrays (SSA's) are only dual ported, there needs to be a new way that would effectively fence a non-member node out of the shared resources.
The Cluster Membership Monitor, CMM, which is responsible for the membership, quorum and failure fencing algorithms, handles state transitions which lead to changes in the membership. These transitions are listed below.
It is also important to note that the CMM does not guarantee the health of the overall system or that the applications are present on any given node. The only guarantees made by the CMM is that the system's hardware is up and running and that the operating system is present and functioning.
We would like to explicitly define what failures are considered in the design of the system. There are three failures that we consider; node failures, communication failures, and device failures. Note that the failures of the client nodes, terminal concentrators, and the administration workstation are not considered to be failures within "our" system.
Some distributed computer systems are specified to have no single point of failures. Therefore, the system must tolerate the failure of a single node as well as consecutive failures of n-1 nodes of the system. Given the above discussion on communication failures, this specification implies that we cannot tolerate the total loss of the communication medium in such a system. While it may not be possible, or desirable, to tolerate a total loss of the private communication medium, it should be possible to tolerate more than a single failure at any given time. First, let us define what a cluster is and how various failures affect it.
A cluster is defined as having N nodes, a private communication medium, and a quorum mechanism, where the total failure of the private communication medium is equivalent to the failure of N-1 nodes and the failure of the quorum mechanism is equivalent to the failure of one node.
Now we can make the following fault-tolerance goal for the cluster membership monitor;
A cluster with N nodes, where N.gtoreq.3, a private communication medium, and a quorum mechanism, should be able to provide services and access to the data, however partial, in the case of .left brkt-top.N/2.right brkt-top.-1 node failures. For a two node cluster the cluster can tolerate only one of the following failures:
Note that the total loss of the communication medium in a system with more than 2 nodes is indeed a double failure (as both switches would have to be non-operational) and the system is not required to tolerate such a failure.