Distributed computing systems are an increasingly important part of research, governmental, and enterprise computing systems. Among the advantages of such computing systems are their ability to handle a variety of different computing scenarios including large computational problems, high volume data processing situations, and high availability situations. For applications that require the computer system to be highly available, e.g., the ability to maintain the system while still providing services to system users, a cluster of computer systems is a useful implementation of the distributed computing model. In the most general sense, a cluster is a distributed computer system that works together as a single entity to cooperatively provide processing power and mass storage resources. With a cluster, the processing load of the computer system is typically spread over more than one computer, thereby eliminating single points of failure. Consequently, programs executing on the cluster can continue to function despite a problem with one computer in the cluster. In another example, one or more computers of the cluster can be ready for use in the event that another computer in the cluster fails. While each computer in a cluster typically executes an independent instance of an operating system, additional clustering software is executed on each computer in the cluster to facilitate communication and desired cluster behavior.
FIG. 1 illustrates a simplified example of a cluster 100. The members of the cluster include Server A 140 and Server B 145. As members of cluster 100, servers 140 and 145 are often referred to as “hosts” or “nodes.” Thus, a node in a computer cluster is typically an individual computer system having some or all of the software and hardware components illustrated and as is well known in the art. FIG. 5 (described below) illustrates some of the features common to cluster nodes. Another common feature of a cluster is the ability of the nodes to exchange data. In the example of FIG. 1, servers 140 and 145 can exchange data over network 120, typically a local area network (LAN), e.g., an enterprise-wide intranet, or a wide area network (WAN) such as the Internet. Additionally, network 120 provides a communication path for various client computer systems 110 to communicate with servers 140 and 145. In addition to network 120, servers 140 and 145 can communicate with each other over private network 130. As shown, private network 130 is only accessible by cluster nodes, i.e., Server A 140 and Server B 145. To support the high availability of cluster 100, private network 130 typically includes redundancy Such as two network paths instead of one. Private network 130 is used by the nodes for cluster service message passing including, for example, the exchange of so-called “heart-beat” signals indicating that each node is currently available to the cluster and functioning properly. Similar functions can be implemented using a public network.
Other elements of cluster 100 include storage area network (SAN) 150, SAN switch 160, and storage devices such as tape drive 170, storage array 180, and optical drive 190. These devices are examples of the type of storage used in cluster 100. Other storage schemes include the use of shared direct-attached storage (DAS) over shared SCSI buses. As shown in FIG. 1, both servers 140 and 145 are coupled to SAN 150. SAN 150 is conventionally a high-speed network that allows the establishment of direct connections between storage devices 170, 180, and 190 and servers 140 and 145. Thus, SAN 150 is shared between the servers and allows for the sharing of storage devices between the servers to providing greater availability and reliability of storage. SAN 150 can be implemented using a variety of different technologies including fibre channel arbitrated loop (FCAL), fibre channel switched fabric, IP networks (e.g., iSCSI), Infiniband, etc.
SAN switch 160, tape drive 170, storage array 180, and optical drive 190 are all examples of shared resources. The most common shared resource in a cluster is some form of shared data resource, such as one or more disk drives. Using a shared data resource gives different nodes in the cluster access to the same data, a feature that is critical for most cluster applications. Although a disk device (and various related devices such as storage array 180) is perhaps the most common example of both a shared resource and a shared data resource, a variety of other types of devices will be well known to those having ordinary skill in the art. Moreover, although servers 140 and 145 are shown connected to storage array 180 through SAN switch 160 and SAN 150, this need not be the case. Shared resources can be directly connected to some or all of the nodes in a cluster, and a cluster need not include a SAN. Alternatively, servers 140 and 145 can be connected to multiple SANs. Additionally, SAN switch 160 can be replaced with a SAN router or a SAN hub.
One well known problem among computer system clusters and other distributed computing systems is the coordination of input/output (I/O) operations on the shared resources. Since multiple nodes have access to the same data resources, care must be taken to ensure that data is not corrupted, e.g., because of uncoordinated write operations to the same logical or physical portions of a storage device or read operations that do not present data reflecting the most recent updates.
A variety of software mechanisms, as illustrated in FIG. 1, are employed to both enable clustering functionality and prevent data corruption. A cluster volume manager virtualizes shared storage so as to present a consistent view of shared storage, typically in a logical format such as one or more volumes, to all nodes of the cluster. Additionally, a cluster volume manager allows an administrator to configure and reconfigure shared storage. In some implementations, this reconfiguration can be accomplished without interrupting applications' access to the storage. A cluster monitor regularly checks the status or “health” of each node in the cluster to quickly and reliably determine when a node stops functioning (or stops functioning properly) and inform the remaining nodes so that they can take appropriate action. In some embodiments, a cluster messaging service, which can be a part of cluster monitor and/or a separate software or hardware system, exists to quickly and reliably communicate cluster-critical information among the nodes in a secure manner. Finally, a cluster locking mechanism provides distributed locks that are used by instances of a cluster application to achieve proper coordination. In some embodiments, this is achieved through the use of a formalized distributed lock manager. In still other embodiments, the lock management is implemented in an ad hoc fashion using the messaging services to communicate and coordinate the state. These software tools operate in conjunction with applications, database management systems, file systems, operating systems, etc., to provide distributed clustering functionality.
In one approach to I/O coordination, one node is elected as master of all the shared storage and the remaining nodes are slaves. The master node can typically change disk configurations and maintains control over disk areas used for transaction logs. The master node also reads volume management metadata from all of the disks and maintains this mapping between each logical block of the volume and one or more physical blocks of the disks. Slave nodes must obtain copies of this volume management metadata in order to have knowledge of the volume organization. Moreover, if there are changes to the volume configuration, that change must be communicated to all of the slaves using, for example, a messaging protocol and/or a system of shared and exclusive locks on the volume management metadata.
In clustering systems and other distributed computing environments where changes to volume con figuration occur frequently, the added system resource overhead needed to make all nodes aware of the changes can be burdensome. Accordingly, it is desirable to have a more scalable and flexible scheme for performing I/O operations on shared resources in a clustering environment.