1. Technical Field
The present invention relates to the field of disaster recovery computer systems, and more particularly to the resynchronization of primary and secondary copies of data after a disconnection and reestablishment of the Remote Copy pair.
2. Description of the Related Art
In the field of computer data processing there is increasing demand for ways for organizations to continue doing business even after the loss of use of data or processing capability at the main business data processing site. The technique used in typical disaster recovery solutions is known in the art as “Remote Copy”, or sometimes “Peer-to-Peer Remote Copy (PPRC)”.
In a typical Remote Copy solution, one storage controller is designated as holding the primary disk of a remote copy relationship. The primary disk of the relationship will be referred to herein as the Master. The Master is the disk normally used by a person or organization for day-to-day processing. A second storage controller holds the secondary disk of the remote copy relationship, which will be known as the Auxiliary. The Auxiliary is the disk normally not used by a person or organization for day-to-day processing, but held in reserve in case of a need for disaster recovery or business continuity operations after the loss of use of the Master. Both Master and Auxiliary are the same size. Many solutions allow multiple sets of disks to be managed in a coordinated fashion, and often a controller might hold Masters for one relationship, and Auxiliaries for others, but for clarity and conciseness the present description will focus on a single relationship comprising two disks. In normal operation, the Master is used as the primary source and target of all host I/O requests. In these circumstances, the term Master/primary will be used in this description. Similarly, in normal operation, the Auxiliary is not used as the source or target of host I/O requests, but is used to hold a copy of the data from the Master/primary and to accept changes passed on to it from the Master/primary as a result of writes directed to the Master/primary. In these circumstances, the term Auxiliary/secondary will be used in this description. The Master/primary is thus the disk that normally, in the absence of a disaster, holds the application data. The function of Remote Copy is to maintain a copy of that data on the Auxiliary/secondary disk.
To establish initial synchronization, all the data is copied from Master/primary to Auxiliary/secondary. Once synchronization has been established, each write I/O received at the Master/primary is sent to the Auxiliary/secondary disk as well as to the Master/primary. Under normal situations, the Auxiliary/secondary does not receive writes from applications directly, but only indirectly from writes issued at the primary and forwarded to it.
In the event of a loss of connection between the two sites, a conventional technique that is well known is to use change recording at the Master/primary. This typically uses a bitmap to record which regions of the disk at the Master/primary have received write I/O. It is common to map a single bit to 32 k of data, or some similar fairly small amount. Once the link is reestablished, the bitmap is used to resynchronize the Auxiliary/secondary, bringing it fully up to date with the Master/primary, by transferring data corresponding to every bit marked as changed in the bitmap.
However, there are uses of Remote Copy where this well-known scheme by itself is insufficient. One example is what is done after a disaster. Typically, if a disaster occurs at the Master/primary controller, then access to the Auxiliary/secondary controller is enabled, and the application is restarted using the storage there. This situation will be referred to in this description by using the term Auxiliary/primary.
However, the next thing that is needed is to reestablish a disaster recovery capability. In many ‘disasters’, the Master site is in fact physically intact, possibly only having suffered a power failure or a similar short-term failure. It is thus possible to use the Master (old primary) as the secondary of the relationship (thus creating a Master/secondary), and to have the Auxiliary become the primary (as an Auxiliary/primary, as defined above), essentially reversing the flow of data. While this is possible with today's products, they require that a full copy be performed from Auxiliary/primary to Master/secondary, repeating the problem faced by the user in the initial setup.
While this cost may at first appear to be acceptable because a real disaster is an infrequent occurrence, it must be borne in mind that testing the disaster recovery system is an essential part of any disaster recovery plan. Some companies and other organizations are required to demonstrate their disaster recovery capability in order to pass an audit, possibly as frequently as once a month. If the disaster recovery test involves carrying out a complete failing-over of the business as described above, the cost of a full copy from Auxiliary/primary to Master/secondary to reestablish synchronization is very heavy.
All known conventional schemes require a full copy after such a failover scenario, unless great care was taken to ensure that the application was completely halted at the old primary with no outstanding, “in-flight” updates, before switching the primary/secondary roles. This, however, is untypical of the way in which complex systems fail. Frequently, failures are of the type known as “rolling failures”, where parts of the original Master/primary system fail over a period of time before the failover is triggered. In these circumstances, there may be changes made at the old Master/primary during the rolling failure of which the original Auxiliary/secondary has not been made aware.
It might be thought that the solution to the problem would be to set up the remote copy in reverse, and simply use change recording on the Auxiliary/primary to define what must be copied back to the Master/secondary after a disaster has been recovered. This is inadequate, because, as described above, changes might have happened at the original Master/primary during the failure, which were not change-recorded at the original Auxiliary/secondary. If these are not corrected, then the Master and the Auxiliary may never become truly synchronized.
A different scenario, but again one which might occur in the context of a disaster recovery or other form of test (such as an upgrade test), is where the Auxiliary/secondary is broken away from the Master/primary, and then directly receives write I/Os in its isolated state, perhaps from a test application, while the business continues to run as normal at the Master. Here, the resynchronization after reestablishment of the connection must be from Master to Auxiliary, even though the Auxiliary has been temporarily treated as an Auxiliary/primary while the Master was simultaneously being treated as a Master/primary. It is essential in this case that the real application data at the Master not be overwritten by the test data that has been applied at the Auxiliary during the period of its isolation from the Master.
It is therefore desirable to have an efficient means of Remote Copy resynchronization while alleviating the disadvantages of applying costly full copies of data as in the conventional systems described above.