1. Field of the Invention
The present invention is related to improved data synchronization.
2. Description of the Related Art
In typical disaster recovery solutions, data is housed at a primary site as well as at one or more secondary sites. These secondary sites maintain a synchronized copy of the data such that no data is lost in the case of a disaster at the primary site. If a disaster occurs, processing is either “failed-over” to one of the secondary sites or the data is copied from the secondary site back to the primary site. In order for disaster recovery to be effective, the secondary sites are typically far away from the primary site so that both sites are not affected by the same disaster.
Disaster recovery systems typically address two types of failures, a sudden catastrophic failure at a single point in time or data loss over a period of time. In the second type of gradual disaster, updates to volumes may be lost. For either type of failure, a copy of data may be available at a remote location. Such dual or shadow copies are typically made as the application system is writing new data to a primary storage device at a primary site. A storage device is a physical unit that provides a mechanism to store data on a given medium, such that the data can be subsequently retrieved. International Business Machines Corporation (IBM), the assignee of the subject patent application, provides systems for maintaining remote copies of data at a secondary storage device, including extended remote copy (XRC®) and peer-to-peer remote copy (PPRC).
These systems provide techniques for recovering data updates between a last, safe backup and a system failure. Such data shadowing systems can also provide an additional remote copy for non-recovery purposes, such as local access at a remote site. The IBM XRC and PPRC systems are described further in z/OS V1R1.0 DFSMS Advanced Copy Services (IBM Document Number SC35-0428-00), April 2001, which is available from International Business Machines Corporation.
In such backup systems, data is maintained in volume pairs. A volume pair is comprised of a volume in a primary storage device and a corresponding volume in a secondary storage device that includes a consistent copy of the data maintained in the primary volume. Typically, the primary volume of the pair will be maintained in a primary storage control unit, and the secondary volume of the pair is maintained in a secondary storage control unit at a different physical location than the primary storage control unit. A storage control unit is a physical hardware unit that consists of a storage server integrated with one or more storage devices to provide storage capability to a host computer. A storage server is a physical unit that provides an interface between one or more storage devices and a host computer by providing the function of one or more logical subsystems. The storage server may provide functions that are not provided by the storage device. The storage server is composed of one or more clusters of storage devices. A primary storage control unit may be provided to control access to the primary DASD and a secondary storage control unit may be provided to control access to the secondary DASD.
It is important that all secondary data sites are synchronized and contain an exact copy of the primary site's data. Sometimes, however, errors occur that cause the system to not know whether the primary and secondary sites are synchronized. In typical disaster recovery solutions, if a secondary site loses certainty of synchronization with the primary site, all of the data must be copied from the primary site to the secondary site. For large systems that are typical for large corporations, the time required to resynchronize the two sites is enormous due to the tremendous amount of data that must now be copied. Besides the time it takes to carry out this copy, it must also be remembered that as a result of this recopy, the data link between the two sites is being used much more heavily than is typical. This also causes normal processing that continues to be impacted since the bandwidth necessary to continue may no longer be available.
In particular, in prior art systems, when two volumes lose synchronization for any reason, it is necessary for the primary site to send the entire volume of data to the secondary site. If many volumes are affected and/or the volumes are very large, this could take a considerable amount of time. Not only will it take a long time, but all the data being sent will increase the bandwidth used on the long distance data link tremendously. If the system does not have a good amount of extra bandwidth, and, typically, conventional systems do not, then this resynchronization would impact all other processing and disaster recovery mirroring currently happening in the system as well. Furthermore, in most cases of lost synchronization, very little of the data, if any, is actually out of synchronization. As a result, the entire volume of data will be recopied when only a few portions of data are actually not identical.
Thus, there is a need for improved data synchronization.