Storage area networks and server clustering technology allow multiple host computers to connect to the same array of storage devices, typically disks. However, such an arrangement can lead to serious problems when the disk is improperly accessed. For example, simultaneous write and read accesses by different hosts may corrupt a disk's data, potentially leading to very serious consequences.
One solution to this problem of protecting a shared storage device (or devices) is to give exclusive access to the device to one computer at a time. For example, in U.S. patent application Ser. No. 08/963,050, entitled “Method and System for Quorum Resource Arbitration in a Server Cluster,” assigned to the same assignee as the present invention, cluster nodes arbitrate for exclusive ownership of a quorum resource, which ensures that only one unique incarnation of a cluster can exist at any given time, since only one node can exclusively possess the quorum resource. As another example, in U.S. patent application Ser. No. 09/277,450, entitled “Method and System for Consistent Cluster Operational Data in a Server Cluster Using a Quorum of Replicas,” assigned to the same assignee as the present invention, the quorum resource is not limited to a single device, but rather is comprised of multiple replica members. A cluster may be formed and continue to operate as long as one server node possesses a quorum (majority) of the replica members.
In both of these above examples, the node that initially obtains ownership of the quorum resource forms and represents the cluster, and access to the quorum resource (e.g., reads and writes to the disk or disks) is through the owning node. This protects against data corruption.
However, in clustering and distributed system technology, a problem sometimes arises when nodes lose their ability to communicate with other nodes, e.g., due to the crash of a node, or some other type of failure such as a network communication failure. As a result, the nodes that do not own the resource are configured to challenge for resource ownership in case the owning node has failed. To this end, an appropriate arbitration process on each node enables another node to challenge for ownership of each owned resource by temporarily breaking the owning node's exclusive reservation, (e.g., by SCSI bus reset or bus device reset commands), delaying, and then requesting an exclusive reservation. During the delay, the owning node is given an opportunity to defend and persist its exclusive reservation, whereby if the node is operating correctly, it replaces its exclusive reservation. If the owning node is not able to replace its reservation during the delay, the challenging node's request for exclusive access following the delay succeeds, whereby the challenging node becomes the new owner.
While the above-described mechanisms are excellent for sets of nodes that implement the arbitration rules, the breaking of the reservation leaves the resource in an unreserved state until the challenger or owner can obtain an exclusive reservation. At that time, the resource is vulnerable to being improperly accessed. Further, a third party computing device may independently break (e.g., for various unrelated purposes) the owning node's exclusive reservation. For example, in a SCSI-2 configuration, a SCSI bus reset command used to break the reservation. If a third party computing device initiates a SCSI bus reset or SCSI bus device reset, then the owning node's exclusive reservation is temporarily lost, and access to the disk can be improperly obtained, making the disk vulnerable to simultaneous access, data corruption and so forth.