Distributed computing systems are an increasingly important part of research, governmental, and enterprise computing systems. Among the advantages of such computing systems are their ability to handle a variety of different computing scenarios including large computational problems, high volume data processing situations, and high availability situations. For applications that require the computer system to be highly available, e.g., the ability to maintain the system while still providing services to system users, a cluster of computer systems is a useful implementation of the distributed computing model. In the most general sense, a cluster is a distributed computer system that works together as a single entity to cooperatively provide processing power and mass storage resources. With a cluster, the processing load of the computer system is typically spread over more than one computer, thereby eliminating single points of failure. Consequently, programs executing on the cluster can continue to function despite a problem with one computer in the cluster. In another example, one or more computers of the cluster can be ready for use in the event that another computer in the cluster fails. While each computer in a cluster typically executes an independent instance of an operating system, additional clustering software is executed on each computer in the cluster to facilitate communication and desired cluster behavior.
FIG. 1 illustrates a simplified example of a cluster 100. The members of the cluster include Server A 110 and Server B 120. As members of cluster 100, servers 110 and 120 are often referred to as “hosts” or “nodes.” Thus, a node in a computer cluster is typically an individual computer system having some or all of the common as is well known in the art. FIG. 6 (described later in this application) illustrates some of the features common to cluster nodes. Another common feature of a cluster is the ability of the nodes to exchange data. In the example of FIG. 1, servers 110 and 120 can exchange data over network 150, typically a local area network (LAN), e.g., an enterprise-wide intranet, or a wide area network (WAN) such as the Internet. Additionally, network 150 provides a communication path for various client computer systems 140 to communicate with servers 110 and 120. In addition to network 150, servers 110 and 120 can communicate with each other over private network 130. As shown, private network 130 is only accessible by cluster nodes, i.e., Server A 110 and Server B 120. To support the high availability of cluster 100, private network 130 typically includes redundancy such as two network paths instead of one. Private network 130 is used by the nodes for cluster service message passing including, for example, the exchange of so-called “heart-beat” signals indicating that each node is currently available to the cluster and functioning properly.
Other elements of cluster 100 include storage area network (SAN) 160, SAN switch 165, and storage devices such as tape library 170 (typically including one or more tape drives), a group of disk drives 180 (i.e., “just a bunch of disks” or “JBOD”), and intelligent storage array 190. These devices are examples of the type of storage used in cluster 100. Other storage schemes include the use of shared direct-attached storage (DAS) over shared SCSI buses. SAN 160 can be implemented using a variety of different technologies including fibre channel arbitrated loop (FCAL), fibre channel switched fabric, IP networks (e.g., iSCSI), Infiniband, etc.
SAN switch 165 and storage devices 170, 180, and 190 are examples of shared resources. The most common shared resource in a cluster is some form of shared data resource, such as one or more disk drives. Using a shared data resource gives different nodes in the cluster access to the same data, a feature that is critical for most cluster applications. Although a disk device is perhaps the most common example of both a shared resource and a shared data resource, a variety of other types of devices will be well known to those having ordinary skill in the art. Moreover, although servers 110 and 120 are shown connected to storage array storage devices through SAN switch 165 and SAN 160, this need not be the case. Shared resources can be directly connected to some or all of the nodes in a cluster, and a cluster need not include a SAN. Alternatively, servers 110 and 120 can be connected to multiple SANs. Additionally, SAN switch 165 can be replaced with a SAN router or a SAN hub.
One known problem among computer system clusters occurs when one or more of the nodes of the cluster erroneously believes that other node(s) are either not functioning properly or have left the cluster. This “split-brain” condition results in the effective partitioning of the cluster into two or more subclusters. Causes of the split-brain condition include failure of the communication channels between nodes, e.g., failure of private network 130, and the processing load on one node causing an excessive delay in the normal sequence of communication among nodes, e.g., one node fails to transmit its heartbeat signal for an excessive period of time. For example, if cluster 100 is configured for failover operation with an application program operating on server A 110 and server B 120 existing in the cluster to takeover for server A should it fail, then complete failure of private network 130 would lead server B to conclude that server A has failed. Server B then begins operation even though server A has not in fact failed. Thus, the potential exists that the two servers might attempt to write data to the same portion of one of the storage devices thereby causing data corruption. The solution is to ensure that one of the nodes cannot access the shared resource, i.e., to “fence off” the node from the resource.
Cluster partitioning can take a variety of other forms and have a variety of detrimental effects. For example, a node might attempt to reenter a cluster after the node has already been successfully excluded from the cluster. Thus, the reentering node might encounter a cluster environment setup to exclude the node and interpret that instead as a partition event. Additionally, cluster partitioning can be problematic even though there is no shared resource among the cluster nodes, so called “shared nothing” clusters. For example, if one node of a cluster is supposed to be the node interacting with a client device and another node detects a cluster partition, the client device could ultimately communicate with the wrong node thereby leading to some manner of error.
Many existing solutions to the split-brain problem focus on a single technique or mechanism for determining which nodes should remain in a cluster and how to protect shared data subsequent to a cluster partition event. One example of such a solution can be found in the pending U.S. patent application Ser. No. 10/105,771, entitled “System and Method for Preventing Data Corruption in Computer System Clusters,” naming Bob Schatz and Oleg Kiselev as inventors, and filed on Mar. 25, 2002 (“the '771 application”) which is hereby incorporated by reference herein in its entirety.
While techniques such as those described in the '771 application adequately address split-brain problems, they may suffer some other deficiency that makes them less desirable. For example, fencing techniques that make use of SCSI-3 persistent reservation commands (such as those described in the '771 application) can require the use of specialized hardware such as SCSI-3 compliant devices. This requirement may impose certain cost or flexibility restrictions that make the particular technique less desirable. Moreover, some cluster implementations may benefit from the use of multiple different fence mechanisms, rather than a single fence mechanism.
Accordingly, it is desirable to have a generalized I/O fencing framework for providing and using one or more scalable, flexible, and robust I/O fencing schemes for handling cluster partition conditions.