1. Field of the Invention
This invention relates to systems and methods for increasing system availability in Peer-to-Peer-Remote-Copy (“PPRC”) environments.
2. Background of the Invention
In data replication environments such as Peer-to-Peer-Remote-Copy (“PPRC”) environments, data is mirrored from a primary storage device to a secondary storage device to maintain two consistent copies of the data. The primary and secondary storage devices may be located at different sites, perhaps hundreds or even thousands of miles away from one another. In the event the primary storage device fails, I/O may be redirected to the secondary storage device, thereby enabling continuous operations. When the primary storage device is repaired, I/O may be redirected back to the former primary storage device. The process of redirecting I/O from the primary storage device to the secondary storage device when a failure or other event occurs may be referred to as a swap or HyperSwap.
HyperSwap is a function provided by IBM's z/OS operating system that provides continuous availability for disk failures by maintaining synchronous copies of primary disk volumes on one or more secondary storage controllers. When a disk failure is detected at a primary site, a host system running the z/OS operating system identifies HyperSwap managed volumes. Instead of rejecting I/O requests, the host system uses the HyperSwap function to switch (or swap) information in internal control blocks so that I/O requests are driven against synchronous copies at the secondary site. Since the secondary volumes are identical copies of the primary volumes prior to the failure, the I/O requests will succeed with minimal (i.e. a slight delay in I/O response time) impact on the issuing applications. This functionality masks disk failures from applications and ideally avoids application or system outages. An event which initiates a HyperSwap may be referred to as a “swap trigger.”
In HyperSwap environments, communication links between primary and secondary volumes may fail, thereby making it impossible to mirror data between the volumes. Such an event may be referred to as a “suspend trigger” since it may cause mirroring to be suspended between the primary and secondary volumes. When a suspend trigger is detected at a primary or secondary storage controller, the storage controller may notify a host system that mirroring has been suspended. The primary storage controller may in turn delay I/O requests to affected volumes of the primary storage controller. This delay provides the host system the opportunity to suspend all mirroring to the secondary site in order to ensure a consistent copy of data exists at the secondary site, before resuming I/O to the primary site. Since mirroring is suspended, the host system will disable HyperSwap since identical copies of the data no longer exist at the primary and secondary sites.
In certain cases, one or more of the swap trigger and suspend trigger may be caused by a “rolling disaster,” where one piece of equipment is affected prior to another. Such a “rolling disaster” may be caused by a fire, flood, earthquake, power failure, or the like. In such cases, a swap trigger and suspend trigger may occur at nearly the same point in time. Current HyperSwap processing depends upon the order in which the events are detected at a host system. If the swap trigger is detected first, a HyperSwap will occur. In such cases, systems that are not impacted by the rolling disaster may survive. However, if the suspend trigger is detected first, the HyperSwap feature will be disabled and no HyperSwap will occur. In such a case, all systems will likely fail, particularly if volumes affected by the disaster are critical. In a rolling disaster, the order in which the triggers are detected at a host system cannot be predicted, making it impossible to predict whether systems that are unaffected by the disaster will HyperSwap and survive the disaster, or have HyperSwap disabled and fail.
In view of the foregoing, what are needed are systems and methods to increase the likelihood that systems will survive a rolling disaster or other similar event in PPRC environments regardless of the order in which a swap trigger and suspend trigger are detected. Ideally, such systems and methods will preserve as much as possible normal behavior for events other than rolling disasters, such as in cases where one of a swap trigger and/or suspend trigger occurs without the other, or a swap trigger and suspend trigger are temporally separated from one another.