Distributed computing systems are an increasingly important part of research, governmental, and enterprise computing systems. Among the advantages of such computing systems are their ability to handle a variety of different computing scenarios including large computational problems, high volume data processing situations, and high availability situations. For applications that require the computer system to be highly available, e.g., the ability to maintain the system while still providing services to system users, a cluster of computer systems is a useful implementation of the distributed computing model. In the most general sense, a cluster is a distributed computer system that works together as a single entity to cooperatively provide processing power and mass storage resources. With a cluster, the processing load of the computer system is typically spread over more than one computer, thereby eliminating single points of failure. Consequently, programs executing on the cluster can continue to function despite a problem with one computer in the cluster. In another example, one or more computers of the cluster can be ready for use in the event that another computer in the cluster fails. While each computer in a cluster typically executes an independent instance of an operating system, additional clustering software is executed on each computer in the cluster to facilitate communication and desired cluster behavior.
FIG. 1 illustrates a simplified example of a cluster 100. The members of the cluster include Server A 110 and Server B 120. As members of cluster 100, servers 110 and 120 are often referred to as “hosts” or “nodes.” Thus, a node in a computer cluster is typically an individual computer system having some or all of the common as is well known in the art. FIG. 11 (described later in this application) illustrates some of the features common to cluster nodes. Another common feature of a cluster is the ability of the nodes to exchange data. In the example of FIG. 1, servers 110 and 120 can exchange data over network 150, typically a local area network (LAN), e.g., an enterprise-wide intranet, or a wide area network (WAN) such as the Internet. Additionally, network 150 provides a communication path for various client computer systems 140 to communicate with servers 110 and 120. In addition to network 150, servers 110 and 120 can communicate with each other over private network 130. As shown, private network 130 is only accessible by cluster nodes, i.e., Server A 110 and Server B 120. To support the high availability of cluster 100, private network 130 typically includes redundancy such as two network paths instead of one. Private network 130 is used by the nodes for cluster service message passing including, for example, the exchange of so-called “heart-beat” signals indicating that each node is currently available to the cluster and functioning properly.
Other elements of cluster 100 include storage area network (SAN) 160, SAN switch 165, and storage devices such as tape library 170 (typically including one or more tape drives), a group of disk drives 180 (i.e., “just a bunch of disks” or “JBOD”), and intelligent storage array 190. These devices are examples of the type of storage used in cluster 100. Other storage schemes include the use of shared direct-attached storage (DAS) over shared SCSI buses. SAN 160 can be implemented using a variety of different technologies including fibre channel arbitrated loop (FCAL), fibre channel switched fabric, IP networks (e.g., iSCSI), Infiniband, etc.
SAN switch 165 and storage devices 170, 180, and 190 are examples of shared resources. The most common shared resource in a cluster is some form of shared data resource, such as one or more disk drives. Using a shared data resource gives different nodes in the cluster access to the same data, a feature that is critical for most cluster applications. Although a disk device is perhaps the most common example of both a shared resource and a shared data resource, a variety of other types of devices will be well known to those having ordinary skill in the art. Moreover, although servers 110 and 120 are shown connected to storage array storage devices through SAN switch 165 and SAN 160, this need not be the case. Shared resources can be directly connected to some or all of the nodes in a cluster, and a cluster need not include a SAN. Alternatively, servers 110 and 120 can be connected to multiple SANs. Additionally, SAN switch 165 can be replaced with a SAN router or a SAN hub.
One known problem among computer system clusters occurs when one or more of the nodes of the cluster erroneously believes that other node(s) are either not functioning properly or have left the cluster. This “split-brain” condition results in the effective partitioning of the cluster into two or more subclusters. Causes of the split-brain condition include failure of the communication channels between nodes, e.g., failure of private network 130, and the processing load on one node causing an excessive delay in the normal sequence of communication among nodes, e.g., one node fails to transmit its heartbeat signal for an excessive period of time. For example, if cluster 100 is configured for failover operation with an application program operating on server A 110 and server B 120 existing in the cluster to takeover for server A should it fail, then complete failure of private network 130 would lead server B to conclude that server A has failed. Server B then begins operation even though server A has not in fact failed. Thus, the potential exists that the two servers might attempt to write data to the same portion of one of the storage devices thereby causing data corruption. The solution is to ensure that one of the nodes cannot access the shared resource, i.e., to “fence off” the node from the resource.
Cluster partitioning can take a variety of other forms and have a variety of detrimental effects. For example, a node might attempt to reenter a cluster after the node has already been successfully excluded from the cluster. Thus, the reentering node might encounter a cluster environment setup to exclude the node and interpret that instead as a partition event. Additionally, cluster partitioning can be problematic even though there is no shared resource among the cluster nodes. For example, if one node of a cluster is supposed to be the node interacting with a client device and another node detects a cluster partition, the client device could ultimately communicate with the wrong node thereby leading to some manner of error.
Many prior art fencing mechanisms and cluster partition recovery schemes rely on cluster nodes and the cluster software operating on those nodes to have direct access to and/or control of certain shared resources. For example, such mechanisms or schemes might utilize specific coordinator or quorum disk drives, control of which is used to determine which node or nodes survive a cluster partition event and which should be fenced off. These schemes present a number of disadvantages when certain storage virtualization techniques, e.g., out-of-band virtualization techniques are employed.
For example, in-band and out-of-band storage virtualization, as opposed to host-based storage virtualization or storage-based virtualization, provides users with virtualization between the hosts and the storage. Using a storage appliance such as a specialized switch, router, server, or other storage device, in-band and out-of-band storage virtualization allows for the same level of control and centralization across the storage architecture. An in-band virtualization appliance is physically located between the host and the storage. The appliance takes the disk requests from the host and fulfills the host's request from the storage attached to the other side of the appliance. This functionality is essentially transparent to the host because the appliance presents itself as disk. Out-of-band appliances logically present themselves as if they are located in the data path between the host and storage, but they actually reside outside of the data path. Thus, in an out-of-band implementation the data flow is separated from the control flow. This is accomplished, for example, with the installation of a “thin” virtualization driver on the host in the I/O data path. The out-of-band appliance provides the virtualization driver with the storage mappings. The virtualization driver presents virtual storage devices to the applications and file systems on the host and sends the blocks of data directly to correct destinations on disks. In contrast, the in-band appliance requires no host-side changes. It acts as a surrogate for a virtual storage device and performs mapping and I/O direction in a device or computer system located outside of the host.
Accordingly, it is desirable to have scalable, flexible, and robust I/O fencing schemes for handling cluster partition conditions in certain storage virtualization environments to prevent data corruption on a shared data resource used by the cluster.