Clustering is the use of multiple computers, multiple storage devices, and redundant interconnections to form what appears to users as a single, highly available system. A cluster is a shared storage environment in which a collection of these components forms the highly available system.
When a particular component of the cluster fails (e.g. ceases to operate), the functions of that component are assumed by other components within the cluster in a process called “failover”. Some clusters identify component failure by maintaining regular “heartbeat” signals between cluster components. Thus, when a particular component fails to provide a heartbeat signal, the cluster may execute a recovery operation, readjusting the cluster to a configuration that does not include the failed component.
In addition to simply ceasing operation, a cluster component may merely fail to perform some task within a bounded interval. For example, a component may be stopped in a debugging state, failing while in that debugging state to provide a heartbeat signal to the cluster. As another example, a high priority process competing for CPU time can cause an unexpected scheduling delay in a lower priority process, such that the lower priority process appears non-communicative to other components in the cluster. Under these conditions, the cluster may determine that the non-communicative component has failed and, in response, execute a recovery operation.
However, when exiting the debugging state or regaining CPU processing time in the above examples, the component may again communicate with the cluster. Thus, the non-communicative component still may be able to communicate with a storage device in the cluster. For example, a sequence of events may include the step of testing a clock prior to a process performing a particular action. A delay may occur between the test of the clock and the performance of the action. When the delayed process (or non-communicative component) performs the action, this action may be destructive in a way that would not have been possible had the action been performed immediately after the test of the clock. Under these conditions, it is possible to corrupt the storage device.
Specifically, if a particular computer or storage device (i.e. an initiating host) desires to access (e.g. to write to) a particular storage device (i.e. a target device), then the initiating host establishes an interval with (e.g. obtains permission from) the cluster to perform that write operation. Establishing an interval ensures that other hosts in the cluster do not cause corruption by inappropriately interfering with the write operation between the initiating host and the target device. For example, in an asymmetric configuration the initiating host may obtain a lease from a controlling host in the cluster. A lease is an interval corresponding to an amount of time for which the initiating host may access the target device. The cluster maintains awareness that, for the duration of the lease, the initiating host may be accessing the target device. As a result, this approach allows the initiating host to initiate accesses to the target device for the duration of the lease. Similarly, in an asymmetric cluster, a quorum interval is often used to define a period of time during which an initiating host may access a target device.
Another approach, known as a Dead Man Timer, typically involves special hardware. This hardware counts down an interval from an initial value. Periodic communication, e.g. by the initiating host, resets the countdown to the initial value. If the Dead Man Timer counts down to zero, the Dead Man Timer hardware stops operation of the initiating host in a drastic fashion.
Input/Output (I/O) fencing is the term for protecting (i.e. “fencing”) a target disk from potentially corrupting accesses (i.e. “I/O”). For a multiphase I/O operation (e.g. a straight multiphase operation) on, for example, Small Computer System Interface (SCSI) target devices, a write operation from an initiating host has four phases: a write request (Phase 1), a ready to write response (Phase 2), sending the data (Phase 3), and completion response (Phase 4). A SCSI target device additionally supports a device reset request that provides a (passive) time-based barrier to I/O operations on a target disk. An asserted device reset request, among other things, causes the target device to discard any operations between the receipt of a Phase 1 request and sending a Phase 4 response. These operations which are discarded by the target disc result in an identifiable failure provided in response to a data send in Phase 3. In other words, the use of a SCSI device reset allows the target disk to terminate the current operation under these conditions. Note that a period of time prior to the sending of the Phase 4 response may exist such that an I/O in progress may complete prior to the processing of a device reset received in this period of time. As a result, this device reset may not cause the target device TD to discard the operation and so a Phase 4 response is sent.
Unfortunately, when an initiating host appears to the cluster to have failed, but is actually still able to communicate with the target device, it is possible to corrupt the target disk when using a SCSI device reset.
SCSI-3 Persistent Group Reservation (PGR) is a standard technique of I/O fencing supported by some devices that is used to minimize corruption of shared storage devices. In SCSI-3 PGR, a persistent reservation is placed on a shared storage device. This reservation grants access to a specified set of initiating hosts while at the same time denying access to other initiating hosts. Thus, SCSI-3 PGR is a mechanism embedded in a target disk that provides a complete I/O fence. However, SCSI-3 PGR is not uniformly implemented in storage devices, rendering a SCSI-3 PGR solution insufficient. Additionally, many implementations of SCSI-3 PGR are not correct or complete, rendering some existing storage device implementations unusable for SCSI-3 PGR-based I/O fencing.
Therefore, what is needed are methods and systems for providing flexible and reliable I/O fencing in a shared storage environment and correspondingly reliably preventing data corruption in shared storage devices.