1. Field of the Invention
The present invention relates to a method, system, and program for managing failures in mirrored systems.
2. Description of the Related Art
Disaster recovery systems typically address two types of failures, a sudden catastrophic failure at a single point in time or data loss over a period of time. In the second type of gradual disaster, updates to volumes may be lost. To assist in recovery of data updates, a copy of data may be provided at a remote location. Such dual or shadow copies are typically made as the application system is writing new data to a primary storage device. Different copy technologies may be used for maintaining remote copies of data at a secondary site, such as International Business Machine Corporation's (“IBM”) Extended Remote Copy (XRC), Coupled XRC (CXRC), Global Copy, and Global Mirror Copy. These different copy technologies are described in the IBM publications “The IBM TotalStorage DS6000 Series: Copy Services in Open Environments”, IBM document no. SG24-6783-00 (September 2005) and “IBM TotalStorage Enterprise Storage Server: Implementing ESS Copy Services with IBM eServer zSeries”, IBM document no. SG24-5680-04 (July 2004).
In data mirroring systems, data is maintained in volume pairs. A volume pair is comprised of a volume in a primary storage device and a corresponding volume in a secondary storage device that includes an identical copy of the data maintained in the primary volume. Primary and secondary storage controllers may be used to control access to the primary and secondary storage devices. In certain backup system, a sysplex timer is used to provide a uniform time across systems so that updates written by different applications to different primary storage devices use consistent time-of-day (TOD) value as a time stamp. Application systems time stamp data sets when writing such data sets to volumes in the primary storage. The integrity of data updates is related to insuring that updates are done at the secondary volumes in the volume pair in the same order as they were done on the primary volume. The time stamp provided by the application program determines the logical sequence of data updates.
In many application programs, such as database systems, certain writes cannot occur unless a previous write occurred; otherwise the data integrity would be jeopardized. Such a data write whose integrity is dependent on the occurrence of previous data writes is known as a dependent write. Volumes in the primary and secondary storages are consistent when all writes have been transferred in their logical order, i.e., all dependent writes transferred first before the writes dependent thereon. A consistency group has a consistency time for all data writes in a consistency group having a time stamp equal or earlier than the consistency time stamp. A consistency group is a collection of updates to the primary volumes such that dependent writes are secured in a consistent manner. The consistency time is the latest time to which the system guarantees that updates to the secondary volumes are consistent. Consistency groups maintain data consistency across volumes and storage devices. Thus, when data is recovered from the secondary volumes, the recovered data will be consistent.
Consistency groups are formed within a session. All volume pairs assigned to a session will have their updates maintained in the same consistency group. Thus, the sessions are used to determine the volumes that will be grouped together in a consistency group. Consistency groups are formed within a journal device or volume. From the journal, updates gathered to from a consistency group are applied to the secondary volume. If the system fails while updates from the journal are being applied to a secondary volume, during recovery operations, the updates that did not complete writing to the secondary volume can be recovered from the journal and applied to the secondary volume.
Certain applications, such as database applications, may write user data to one set of primary volumes in a session and write exception information to another set of primary volumes in another or the same session. If a failure occurs such that the application cannot continue to write to the primary volumes including the user data, the application may still be able to write exception information on the failure to different primary volumes having the exception information and this failure exception information may be propagated to the secondary volumes mirroring the exception information. In such case; the secondary volumes have error free user data, however the exception information for the user data in the secondary volumes indicates that a failure occurred. During failure recovery operations, the administrator must perform extensive recovery operations at the secondary site to correct this data discrepancy in the mirrored copy because the secondary copy of the exception information indicates a failure or error that is not reflected in the mirrored user data.
For these reasons there is a need in the art for improved techniques for handling failures in a mirrored environment.