A cluster, or plex, is a collection of loosely coupled computing nodes each implemented by a standalone server running its own processes with the cluster providing a single client view of network services and/or applications, such as databases, web services, and file services, for example. These processes communicate with one another to form what looks like a single system that cooperatively provides applications, system resources, and data to users. Clusters may be used to provide scalability and/or highly available computing systems, i.e. systems that run even when a failure occurs that would normally make a server system unavailable. Clusters are based upon well-defined membership that can be dynamically reconfigured to add a node or exclude a node in response to a failure to provide high availability of the overall system. One or more nodes in a cluster may communicate with one or more storage devices via direct connection, over a public network, or via a private interconnect, for example. In general, a storage device provides persistent storage of large amounts of data, such as a magnetic disk or tape, an optical disk, or a solid state device. A shared storage device may be accessed by one or more nodes that are in communication with the storage device.
Because cluster nodes share data and resources, dynamic reconfiguration as a result of a communication failure should not allow a cluster to separate into sub-clusters or partitions that are active at the same time. Otherwise, a condition known as split-brain may occur where each partition “believes” that it is the only partition, and multiple partitions may attempt to modify the shared data resulting in loss of data integrity. A similar condition, referred to as amnesia, may occur when the cluster restarts after a shutdown with cluster configuration data older than at the time of the shutdown. This may result from starting the cluster on a node that was not in the last functioning cluster partition.
Split brain and amnesia may be avoided by using a quorum strategy where each node is assigned one vote and each quorum device is assigned one less vote than the total number of voting nodes connected to that quorum device. Quorum devices may be implemented by a dual hosted or multi-hosted shared disk, by an appropriate Network Attached Storage (NAS) device, or by a quorum server process running on a quorum server machine, for example. In the event of a loss of communication between or among cluster nodes resulting in partitioning of the cluster, only the partition with the majority vote count, or quorum, is allowed to continue to access to the quorum device. Nodes that are currently not active cluster members should not be allowed to modify data on the shared storage device to protect the integrity of data stored on the shared device. This feature may be referred to as fencing. A fencing subsystem may block all access to the shared storage device (both reads and writes), or may only block writes, as the primary concern is typically data integrity rather than data security.
Fencing limits node access to multihost devices by preventing write access to the disks. When a node departs the cluster (by failing or becoming partitioned, for example) fencing ensures that the node can no longer modify data on the disks. Only current member nodes have write access to the disks so that data integrity is ensured. Device services provide failover capability for services that use multihost devices. When a cluster member that currently serves as the primary (owner) of the device group fails or becomes unreachable, a new primary is chosen. The new primary enables access to the device group to continue with only minor interruption. During this process, the old primary must forfeit write access to the devices before the new primary can be started. However, when a member node departs the cluster and becomes unreachable, the cluster can not inform that node to release the devices for which it was the owner. As such, the surviving members need a strategy to take control of global devices previously controlled by a departed node to provide continued access to the surviving members.
Various fencing strategies are known. As previously described, fencing strategies may be used to prevent a fenced node from modifying or writing data to a shared device, in combination with a quorum strategy to determine which partition survives and to transfer ownership of the quorum device(s) to the surviving partition. Although design considerations can generally avoid the situation where more than one partition has the same number of quorum votes, this situation can be addressed by a device acquisition “race” to become the owner node for each quorum device. Shared storage devices that are SCSI-2 compliant use a disk reservation system that either grants access to all nodes attached to the disk (when no reservation is in place), or restricts access to a single node that holds the reservation. The disk reservation is enforced by the hardware or firmware of the disk controller generally communicated by the operating system using ioctls. Because only a single node can hold the reservation, the SCSI-2 standard generally only works well in clusters with two nodes. When a cluster member detects that the other node is no longer communicating, it initiates a fencing procedure, which triggers a reservation on all the disks that are shared to prevent the other node from accessing the shared disks. When the fenced node attempts to write to one of the shared disks, it detects the reservation conflict and panics, or shuts down, with a “reservation conflict” message. If applied to a cluster with more than two nodes, all but one node (the node with the reservation) would panic and shut down.
The SCSI-3 standard was developed to overcome various shortcomings of the SCSI-2 reservation/release approach. In particular, SCSI-3 adds feature enhancements that facilitate use of SCSI-3 compliant storage devices to be used as quorum devices. Similar to the SCSI-2 standard, the fencing features afforded by SCSI-3 are invoked by the operating system using ioctls and implemented by the device controller hardware and/or firmware. Unlike the SCSI-2 reservation/release system, Persistent Group Reservations, or PGRs, allow a host node to make a disk reservation that is persistent across power failures and bus resets. In addition, as their name suggests, PGRs allow a group of nodes in a running cluster to have concurrent access to the shared storage device while preventing access by nodes not in the cluster. While this implementation is suitable for cluster applications utilizing fully compliant SCSI-3 devices, PGR is not implemented on some earlier SCSI-3 devices, or on any SCSI-2 devices. Accordingly, algorithms utilizing PGR features, including the previously discussed quorum algorithms, may be inoperable or unreliable with such devices.
One strategy for implementing fencing and quorum features for non-PGR compliant devices emulates persistent group reservation in software by providing a number of primitives to emulate the group reservation functions otherwise implemented by the device hardware and/or firmware, such as described in commonly owned U.S. Pat. No. 6,658,587, the disclosure of which is incorporated by reference in its entirety. While this strategy is acceptable for many applications as it provides for fencing and quorum operations in clusters having more than two nodes, it continues to rely on a reservation/release type strategy that is limited to dual ported storage devices or those that are SCSI-2 compliant. Because SCSI-2 and SCSI-3 reservation related operations are used primarily for cluster applications, which represent a small portion of the storage device market, storage device manufacturers generally do not dedicate significant resources to consistently testing, supporting, and enhancing these features. In addition, some storage devices do not support SCSI-2 or SCSI-3 reservation related operations. In particular, the Serial Advanced Technology Attachment (SATA) and Solid State Drive (SSD) disks typically do not include SCSI-2 or SCSI-3 reservation related operations.