1. Field of the Invention
This invention relates to computer systems and, more particularly, to management and data integrity maintenance of clustered computer systems.
2. Description of the Related Art
Many business organizations and governmental entities rely upon complex mission-critical applications that require substantial computing power and also often access large amounts (e.g., terabytes or more) of data. The amount of computing power required may exceed the capacity of single computer servers, even in environments where multiprocessor servers are employed. Similarly, the amount of data being handled by the applications may exceed the limits of local storage capabilities of single computer servers. As a result, in many such environments, computer servers may be organized as one or more clusters of multiple cooperating nodes, where each node of a given cluster includes a computer server, and where the nodes of a cluster cooperate with each other to perform the data accesses and computations desired. In addition to providing aggregate increased computing and data processing power, clusters may also be configured to support other features desired in enterprise computing environments, such as increased application availability (e.g., using failover mechanisms) and centralized management.
A number of different types of cluster architectures may be implemented, based for example on application needs such as the expected type of computing and storage access patterns, the desired levels of availability, the nature of coupling (e.g., tight vs. loose coupling) or synchronization between different nodes, or on other factors. In general, components of clustering software are run on each node of a cluster to help coordinate the operations of the nodes and to manage cluster configuration. In one cluster organization, called “shared data clustering”, the nodes of a cluster (each of which may correspond to a separate physical server and/or to an instance of a distributed or clustered application) share access to data storage devices, such as disk arrays or collections of independent disks. For example, the shared data storage devices may be accessible to each node of a cluster over a Storage Area Network (SAN) implemented using a combination of Fibre Channel and other storage interconnects such as various forms of SCSI. Shared data clustering may allow each node to access large amounts of data, and may allow data access and processing to continue in the event of one or more node failures—for example, surviving nodes may take up the functions originally performed by the failed nodes, and may access the data previously being accessed by the failed nodes. In general, the servers comprising a shared data cluster, as well as the collection of shared data storage devices, may be heterogeneous. For example, nodes may differ from one another in number of processors, processing power of each processor, processor architectures or vendors, operating systems, etc., and shared storage devices may differ from each other in storage capacity, vendor, protocols supported, reliability and performance levels, etc.
During normal operation, clustering software at a given node of a shared data cluster may be configured to communicate with clustering software at other nodes over a cluster interconnect. A component of the clustering software, which may be termed a cluster monitor in some environments, may be responsible for maintaining cluster membership information and propagating the information to each node, so that each node retains a consistent view of the set of nodes constituting the cluster. In the event of a failure of communication between the nodes of a cluster, for example due to a hardware failure in the interconnect and/or due to a software error, a cluster may become partitioned—that is, instead of all the nodes being able to communicate with each other, each node may only be able to communicate with a particular subset (or none) of the other nodes. Thus, nodes may form isolated pools, where a given node can communicate only with other nodes in its own pool, but not with a node in any other pool. Under some circumstances, more than one cluster partition or node may assume that it is the only surviving valid partition or node, and may attempt to access shared data as though no other node or partitions remained operational. Such a situation, in which multiple nodes or sets of nodes form independent clusters, may be termed a “split-brain” condition.
In a split-brain environment, more than one partition or node may attempt to access and update shared storage in an uncoordinated manner, thereby potentially causing data corruption. A number of solutions to avoiding data corruption as a result of split-brain have been attempted, such as quorum-disk based protocols and/or reservation-based protocols (e.g., using SCSI-3 persistent reservation mechanisms, based on a particular version of the Small Computer System Interface (SCSI) protocol). Reservation-based protocols may require each shared storage device to support an identical version of a storage protocol, which may not be a practical requirement in many heterogeneous storage environments. In quorum-disk protocols, a particular disk in the storage network may be designated as the quorum disk. Each node in the cluster may be given one vote in determining a quorum, while the quorum disk (or the node containing the quorum disk) may be given a higher number of votes (for example, as high as (N−1) votes if the cluster is expected to include a maximum of N nodes). The partition that has the maximum number of votes is considered to form a quorum and is allowed to continue to support the applications originally running on the cluster, while remaining partitions may be considered invalid (e.g., the nodes of the remaining partitions may be restarted in some cases). Quorum-disk based protocols may be too strict for small clusters, and are often prone to transient connectivity failures. It is possible for the storage devices to end up being partitioned between the pools as well. In cases of host-based mirroring and other data redundancy solutions, this can result in potential data corruption, for example if different mirrors of a mirrored data storage volume end up in different pools and their data are modified in mutually incompatible ways.