Distributed computing systems are an increasingly important part of research, governmental, and enterprise computing systems. Among the advantages of such computing systems are their ability to handle a variety of different computing scenarios including large computational problems, high volume data processing situations, and high availability situations. For applications that require the computer system to be highly available, e.g., the ability to maintain the system while still providing services to system users, a cluster of computer systems is a useful implementation of the distributed computing model. In the most general sense, a cluster is a distributed computer system that works together as a single entity to cooperatively provide processing power and mass storage resources. With a cluster, the processing load of the computer system is typically spread over more than one computer, thereby eliminating single points of failure. Consequently, programs executing on the cluster can continue to function despite a problem with one computer in the cluster. In another example, one or more computers of the cluster can be ready for use in the event that another computer in the cluster fails. While each computer in a cluster typically executes an independent instance of an operating system, additional clustering software is executed on each computer in the cluster to facilitate communication and desired cluster behavior.
FIG. 1 illustrates a simplified example of a cluster 100. The members of the cluster include Server A 110 and Server B 120. As members of cluster 100, servers 110 and 120 are often referred to as “hosts” or “nodes.” Thus, a node in a computer cluster is typically an individual computer system having some or all of the common as is well known in the art. FIG. 8 (described below) illustrates some of the features common to cluster nodes. Another common feature of a cluster is the ability of the nodes to exchange data. In the example of FIG. 1, servers 110 and 120 can exchange data over network 150, typically a local area network (LAN), e.g., an enterprise-wide intranet, or a wide area network (WAN) such as the Internet. Additionally, network 150 provides a communication path for various client computer systems 140 to communicate with servers 110 and 120. In addition to network 150, servers 110 and 120 can communicate with each other over private network 130. As shown, private network 130 is only accessible by cluster nodes, i.e., Server A 110 and Server B 120. To support the high availability of cluster 100, private network 130 typically includes redundancy such as two network paths instead of one. Private network 130 is used by the nodes for cluster service message passing including, for example, the exchange of so-called “heart-beat” signals indicating that each node is currently available to the cluster and functioning properly.
Other elements of cluster 100 include storage area network (SAN) 160, SAN switch 170, and storage array 180. As shown in FIG. 1, both Server A 110 and Server B 120 utilize multiple communications paths to SAN switch 170. SAN switch 170 and storage array 180 are two examples of shared resources. The most common shared resource in a cluster is some form of shared data resource, such as one or more disk drives. Using a shared data resource gives different nodes in the cluster access to the same data, a feature that is critical for most cluster applications. Although a disk device (and various related devices such as storage array 180) is perhaps the most common example of both a shared resource and a shared data resource, a variety of other types of devices will be well known to those having ordinary skill in the art. Moreover, although servers 110 and 120 are shown connected to storage array 180 through SAN switch 170 and SAN 160, this need not be the case. Shared resources can be directly connected to some or all of the nodes in a cluster, and a cluster need not include a SAN. Alternatively, servers 110 and 120 can be connected to multiple SANs. Additionally, SAN switch 170 can be replaced with a SAN router or a SAN hub.
One well known problem among computer system clusters occurs when one or more of the nodes of the cluster erroneously believes that other node(s) are either not functioning properly or have left the cluster. This “split-brain” condition results in the effective partitioning of the cluster into two or more subclusters. Causes of the split-brain condition include failure of the communication channels between nodes, e.g., failure of private network 130, and the processing load on one node causing an excessive delay in the normal sequence of communication among nodes, e.g., one node fails to transmit its heartbeat signal for an excessive period of time. For example, if cluster 100 is configured for failover operation with an application program such as a customer order entry system operating on server A 110 and server B 120 existing in the cluster to takeover for server A should it fail, then complete failure of private network 130 would lead server B to conclude that server A has failed. Server B then begins operation even though server A has not in fact failed. Thus, the potential exists that the two servers might attempt to write data to the same portion of storage array 180 thereby causing data corruption. The solution is to ensure that one of the nodes cannot access the shared resource, i.e., to “fence off” the node from the resource.
Cluster partitioning can take a variety of other forms and have a variety of detrimental effects. For example, a node might attempt to reenter a cluster after the node has already been successfully excluded from the cluster. Thus, the reentering node might encounter a cluster environment setup to exclude the node and interpret that instead as a partition event. Additionally, cluster partitioning can be problematic even though there is no shared resource among the cluster nodes. For example, if one node of a cluster is supposed to be the node interacting with a client device and another node detects a cluster partition, the client device could ultimately communicate with the wrong node thereby leading to some manner of error.
One previous fencing mechanism involves terminating operation of one of the nodes before the takeover occurs. This typically requires platform specific hardware for each of the nodes. Moreover, it is difficult (and potentially expensive) to scale this solution as the number of nodes in the cluster increases. Also, such a system can be difficult to administer. Another solution is to make use of primitive reservation and release functionality available with certain shared resources. For example, shared disk drives supporting version 2 of the small computer systems interface (SCSI-2) allow devices accessing the disk drives to reserve a disk drive using the SCSI-2 “reserve” command and subsequently release the disk drive for use by another device via the “release” command. Unfortunately, SCSI-2 reserve and release settings are cleared when there is a bus reset to the disk drive and thus there is no guarantee that the reservation will not be cleared when it is most needed. Additionally, SCSI-2 reserve and release commands do not work with dynamic multipath devices or with clusters having more than two nodes.
Accordingly, it is desirable to have scalable, flexible, and robust I/O fencing scheme for handling cluster split-brain conditions in order to prevent data corruption on a shared data resource used by the cluster.