Continuous availability (CA) is the attribute of a system or cluster of systems to provide high availability (i.e., mask unplanned outages from an end-user perspective) and continuous operations (i.e., mask planned maintenance from an end-user perspective). Attempts to achieve these attributes have been made utilizing hardware by enabling a system for redundancy with such mechanisms as multiple servers, multiple coupling facilities (CFS), multiple sysplex timers, multiple channel paths spread across multiple switches, etc. Attempts to achieve these attributes have been made utilizing software by enabling a system for software redundancy with redundant z/OS (IBM's operating system for the mainframe environment that operates on zSeries processor) images and multiple software subsystems per z/OS, etc.
Existing CA systems generally comprise disk subsystems that are a single point of failure. For example, where there is only one copy of disk resident data and the disk subsystem becomes nonfunctional for any reason, the system and/or the applications executing therein typically experience an outage even when the system's other components are redundant or fault tolerant. Some CA systems, including those comprising synchronous disk mirroring subsystems, such as those supporting Peer to Peer Remote Copy (PPRC) functions, reduce the opportunity for outages by having two copies of the data and the cluster spread across two geographical locations.
There are several types of outages that a CA system may experience. A first type of outage is a disk subsystem failure. If a PPRC enabled system experiences a primary disk subsystem failure (i.e., the primary disk subsystem is inaccessible causing an impact on service), required repairs can be performed on the primary disk subsystem while simultaneously performing a disruptive failover to use the secondary disk subsystem. Restoration of service typically requires less than one hour, which compares favorably to non-PPRC systems that typically require several hours before service can be restored. In addition, non-PPRC systems may experience logical contamination, such as permanent Input/Output (I/O) errors, which would also be present on the secondary PPRC volume and would require a data recovery action prior to the data being accessible. For example, IBM DB2 will create a Logical Page List (LPL) entry for each table space that receives a permanent I/O error for which recovery is required. Referring again to a system enabled with PPRC, once the primary disk subsystem is repaired the original PPRC configuration is restored by performing a disruptive switch or using existing PPRC/dynamic address switching functions.
A second type of outage that may be experienced is a site failure wherein the failed site includes disk subsystems necessary for continued operations. When a PPRC enabled system experiences a site failure because for example, z/OS images within a site become nonfunctional or the primary PPRC disk subsystem(s) are inaccessible, the operator on the PPRC enabled system can initiate a disruptive failover to the surviving site and restore service within one hour. When the failed site is restored, the original PPRC configuration is restored by performing a disruptive switch or using existing PPRC/dynamic address switching (P/DAS) functions.
A third type of outage that may be experienced is caused by disk subsystem maintenance. When a PPRC enabled system requires disk subsystem maintenance, there are at least two methods for proceeding. The operator may perform a disruptive planned disk switch to use the secondary disk subsystem restoring service typically in less than one hour. The majority of PPRC systems use this technique to minimize the time when their disaster recovery (D/R) readiness is disabled. The system may also use existing PPRC P/DAS functions to transparently switch the secondary disk subsystem into use.
Existing PPRC and z/OS P/DAS mechanisms process each PPRC volume pair switch sequentially as a result of z/OS Input/Output Services Component serialization logic thus requiring approximately twenty to thirty seconds to switch each PPRC pair. A freeze function is issued to prevent I/O disabled for the duration of the P/DAS processing due to primary disks being spread across two sites, resulting in the potential for a lack of Disaster Recovery (D/R) readiness lasting for a significant period of time. For example, assuming that a PPRC enterprise wanted to perform maintenance on one disk subsystem that contained 1024 PPRC volumes and P/DAS were used to perform a transparent switch, the elapsed P/DAS processing time would be equal to 5.7–8.5 hours (1024 volumes* 20–30 seconds processing time per volume pair). Additionally, there are requirements, as described in the IBM publication DFSMS/MVS V1 Advanced Copy Services (SC35-0355), that must be met for P/DAS to work thereby making it very unlikely that a production PPRC disk subsystem pair can be switched using P/DAS without manual intervention. Because many enterprises are unable to tolerate having their D/R readiness disabled for several hours, they often elect to perform a disruptive planned disk switch instead of using the P/DAS function. Once the disk subsystem maintenance is completed, the operator will restore the original PPRC configuration by performing a disruptive switch or use the existing P/DAS function.
The present invention provides a continuous availability solution (in the event of a primary disk subsystem failure and planned maintenance) for transparent disaster recovery for both uni-geographically and multi-geographically located disk subsystems.