1. Field of the Invention
The present invention relates generally to distributed computer systems, and more particularly to a system and method that enables the emulation of Persistent Group Reservations, or PGRs, on non-PGR compliant shared disks to enable the disk""s utilization in a system which implements a PGR-reliant algorithm. One such algorithm enables a non-PGR compliant shared disk to be used as a quorum disk supporting highly available clustering software.
2. Related Art
As computer networks are increasingly used to link computer systems together, distributed operating systems have been developed to control interactions between computer systems across a computer network. Some distributed operating systems allow client computer systems to access resources on server computer systems. For example, a client computer system may be able to access information contained in a database on a server computer system. When the server fails, it is desirable for the distributed operating system to automatically recover from this failure. Distributed computer systems with distributed operating systems possessing an ability to recover from such server failures are referred to as xe2x80x9chighly availablexe2x80x9d systems. High availability is provided by a number of commercially available products including Sun(trademark) Cluster from Sun(trademark) Microsystems, Palo Alto, Calif.
Distributed computing systems, such as clusters, may include two or more nodes, which may be employed to perform a computing task. Generally speaking, a node is a group of circuitry designed to perform one or more computing tasks. A node may include one or more processors, a memory and interface circuitry. Generally speaking, a cluster is a group of two or more nodes that have the capability of exchanging data between nodes. A particular computing task may be performed upon one node, while other nodes perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among the nodes to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operations may be performed in response to instructions executed by the processor.
Nodes within a cluster may have one or more storage devices coupled to the nodes. Generally speaking, a storage device is a persistent device capable of storing large amounts of data. For example, a storage device may be a magnetic storage device such as a disk device, or optical storage device such as a compact disc device. Although a disk device is only one example of a storage device, the term xe2x80x9cdiskxe2x80x9d may be used interchangeably with xe2x80x9cstorage devicexe2x80x9d throughout this specification. Nodes physically connected to a storage device may access the storage device directly. A storage device may be physically connected to one or more nodes of a cluster, but the storage device need not necessarily be physically connected to all the nodes of a cluster. The nodes that are not physically connected to a storage device may not access that storage device directly. In some clusters, a node not physically connected to a storage device may indirectly access the storage device via a data communication link connecting the nodes.
One of the aims of a highly available (HA) system is to minimize the impact of individual components"" failures to system availability. An example of such a failure is a communications loss between some of the nodes of a distributed system. Referring down to FIG. 1, an exemplar cluster is illustrated. In this example, the cluster, 1, comprises four nodes, 102, 104, 106 and 108. The four nodes of the system share a disk, 110. In the exemplar herein presented, nodes 102 through 104 have access to disk 110 by means of paths 120 through 126, respectively. Accordingly, this exemplar disk can be said to be xe2x80x9c4-portedxe2x80x9d. As previously discussed, access to disk 110 may be by means of physical connection, data communication link or other disk access methodologies well-known to those having ordinary skill in the art.
The nodes in the exemplar system are connected by means of data communication links 112, 114, 116 and 118. In the event that data communications links 112 and 114 fail, node 106 will no longer be capable of communication with the remaining nodes in the system. It will be appreciated from study of the figure however that node 106 retains its communications with shared disk 110 by means of path 124. This gives rise to a condition known as xe2x80x9csplit brainxe2x80x9d.
Split brain refers to a cluster breaking up into multiple sub-clusters, or to the formation of multiple sub-clusters without knowledge of one another. This problem occurs due to communication failures between the nodes in the cluster, and often results in data corruption. One methodology to ensure that a distributed system continues to operate with the greatest number of available resources, while excluding the potential for data corruption occasioned by split brain, is through the use of a quorum algorithm with a majority vote count. Majority vote count is achieved when a quorum algorithm detects a vote count greater than half the total number of votes. In a system with n nodes attached to the quorum device, each node is assigned one vote, and the system""s quorum device is assigned nxe2x88x921 votes, as will be later explained.
To explain how a majority vote count quorum algorithm operates, consider the four-node cluster illustrated in FIG. 1, and assume no votes are assigned to a quorum device. Assume a communications failure occurs between node 106 and the other nodes in the cluster. Since each node has one vote, and nodes 102, 104 and 108 are operating properly and are in communication with one another, a simple quorum algorithm would count one vote for each of these devices, against one vote for node 106. Since 3 greater than 1, the subcluster comprising nodes 102, 104 and 108 attains majority vote count and this simplified quorum algorithm excludes node 106 from accessing shared disk 110.
The simplified example previously discussed becomes somewhat more complicated when equal numbers of nodes are separated from one another. Again considering the example shown in FIG. 1, consider the loss of communications links 114 and 118. In this case, nodes 102 and 108 are in communication with one another, as are nodes 104 and 106, but no communications exist between these pairs. In this example, communications are still intact between each of the nodes and shared disk 110. It will be appreciated however, that 2 is not greater than 2, and therefore neither subcluster attains majority vote count and this relatively simple quorum algorithm fails.
A quorum device, or QD, is a hardware device shared by two or more nodes within the cluster that contributes votes used to establish a quorum for the cluster to run. The cluster can operate only when a quorum of votes, i.e. a majority of votes as previously explained, is available. Quorum devices are commonly, but not necessarily, shared disks. Most majority vote count quorum algorithms assign the quorum device a number of votes which is one less than the number of connected quorum device ports. In the previously discussed example having a 4-node cluster having n=4, where each node is ported to the quorum device, that quorum device would be given nxe2x88x921 or 3 votes, although other methods of assigning a number of votes to the quorum device may be used.
The pair of nodes within the cluster that, through the quorum algorithm, first take ownership of the disk cause the algorithm to exclude the other pair. In this example, the two nodes which first take ownership of disk 110 following the fractioning of the cluster, for instance a subcluster comprising nodes 102 and 108, cause the algorithm to exclude the other subcluster comprising nodes 104 and 106 from accessing the shared disk until the system can be restored. This is true since the vote count for the first two nodes accessing the disk plus the three votes for the quorum disk itself is greater than the vote count for the two nodes which later attempt to access the shared disk, or 2+3 greater than 2. A quorum device that allows one or more nodes to take ownership of the device and blocks out other nodes, as previously discussed, is sometimes referred to as a mutex, or mutual exclusion device.
Where a cluster comprises only two nodes, as shown in FIG. 2, a quorum device, such as shared disk 110, is absolutely necessary. This is true is because in the event of the failure of communications link 118, absent such a quorum device, neither node can ever achieve a majority, and hence is incapable of forming a valid cluster. Accordingly, if a cluster were implemented with only two nodes and no quorum device, it will be appreciated that the failure of either node will cause the system to fail.
SCSI, the Small Computer System Interface, is a set of evolving ANSI standard electronic interfaces that allow personal computers to communicate with peripheral hardware such as disk drives, tape drives, CD-ROM drives, printers, and scanners faster and more flexibly than previous interfaces. There are several versions of SCSI, and the older SCSI-2 standards are being replaced by the newer, more fully featured SCSI-3 standards.
The SCSI-3 standard adds two significant enhancements to the SCSI-2 standard that allows SCSI-3 disks to be used as convenient quorum devices. These features are referred to as the Persistent Group Reservation features, or PGRs, of SCSI-3. First, SCSI-3 allows a host node to make a disk reservation that is persistent across power failures and bus resets. Second, group reservations are permitted, allowing all nodes in a running cluster to have concurrent access to the disk while disallowing access to nodes not in the cluster. This persistence property allows SCSI-3 devices to be used as mutex, or mutual exclusion, devices, while the group reservation property allows the disk to be managed by volume managers. Accordingly, the quorum disk can be used for storing customer data. SCSI-3 PGRs are implemented in the device firmware.
The PGR quorum disk implementation provides five primitives to effect the quorum algorithm. They are:
1. Storing a node""s reservation key on the device;
2. Reading all keys on the device;
3. Placing a group reservation for all registered nodes;
4. Reading the group reservation; and
5. Preempting the reservation key of another node.
PGRs utilize a 64-bit reservation key. At least one quorum algorithm has been implemented utilizing persistent group reservation, or PGR. PGR enables preempting and other operations that are required to ensure that only one cluster has access to a shared disk device in the case of split brain. While this implementation is perfectly acceptable for clusters utilizing later SCSI-3 devices, PGR is not implemented on some earlier SCSI-3 devices, or on any SCSI-2 devices. Accordingly, algorithms utilizing PGR features, including the previously discussed quorum algorithms, are currently inoperable with these older device types.
The implementation of any algorithm relying on PGR features, again including quorum algorithms, is readily attainable for systems implementing full-featured SCSI-3 quorum devices, or later versions of those devices. However, such algorithm implementation requires that owners of systems utilizing earlier drive types would, of necessity, be required to upgrade all their shared storage devices to devices implementing the newer standard. This of course presents significant cost and service interruption issues for users of clustered systems. The current alternative is to forego the high availability features of clustering which, in many cases, were the deciding features for users to implement clustered systems.
What is needed then is a methodology which at once enables users of non-PGR devices to implement algorithms, including quorum algorithms, that rely on PGR features, for instance SCSI-3 PGR features. What would be even more useful would be a methodology that would not require new algorithms, or require significant re-programming of the software implementing algorithms which rely on PGR features.
The present invention enables the emulation of PGRs on non-PGR compliant shared disks to enable the users of non-PGR to implement algorithms, including quorum algorithms, based on PGR features. This in turn enables the implementation of algorithms, including quorum algorithms, based on PGRs substantially without major re-writing of the software which implements those algorithms. Where PGRs are implemented in the device firmware, the present invention emulates these PGRs by writing emulation data that emulates those PGRs on a portion of the device itself. In the case where the device is a magnetically recordable device, for instance a hard disk, this emulation data is written to a portion of the recordable media itself. It will be appreciated by those having skill in the art that while the discussion of the features and advantages of the invention taught herein centers on various magnetically recordable and readable devices, these features and advantages are applicable to a wide range of data storage and memory devices. By way of illustration but not limitation, such devices include: semiconductor memory devices such a Flash memory, RAM, ROM, EEPROM and the like; magnetic storage devices including magnetic core memory devices, magnetic tape, floppy disks, hard disks, ZIP(trademark) drives and the like; optical storage devices including CD-ROM, DVD and the like, and mechanical storage devices including Hollerith cards, punched paper tape and the like. The present invention specifically contemplates all such implementations.
To effect this emulation, each host node stores certain host-specific information on its portion of the disk. Additionally, certain group reservation information is also stored on a separate portion of the disk. The present invention accomplishes PGR emulation, or PGRE, by storing this host- and reservation-specific information on a reserved portion of the disk and using this data to emulate the steps of certain PGR primitives.
It will be recalled that the PGRs implementing a quorum disk provide five primitives to effect the quorum algorithm. These include storing a node""s reservation key on the device, reading all keys on the device, preempting the reservation key of another node, placing a group reservation for all registered nodes, and reading the group reservation information.
PGREs emulating the storing and reading of reservation keys, as well as the placing and reading of group reservations, are effected by reading and/or writing the required information from and/or to the disk itself. The emulation of the PGR primitive whereby one subcluster preempts the placement, by another subcluster, of the other subcluster""s reservation key on the device is less straightforward.
The PGR preempt primitive executes a set of steps as a single atomic action, the mutual exclusion necessary for this primitive being done internally by the device. To emulate this primitive, the present invention uses a mutual exclusion algorithm. One embodiment utilizes a novel mutual exclusion algorithm suggested by Lamport""s algorithm, where the disk serves in place of the xe2x80x9cshared memoryxe2x80x9d taught by Lamport. The variables needed by the novel mutual exclusion algorithm taught herein are also stored in the reserved portion of the disk previously discussed.
It should be noted that, while the previously presented background discussion focused on some of the problems attendant upon nodes within a distributed system, the principles of the present invention are not limited in applicability to such nodes or workstations. The principles enumerated herein are capable of implementation to solve a wide variety of computational problems, and the present invention specifically contemplates all such implementations.
These and other advantages of the present invention will become apparent upon reading the following detailed descriptions and studying the various figures of the Drawing.