The reliable recovery of data backups is very important to many industries today. Businesses such as banks, airlines, and hotels face the possibility of large financial losses if data is unrecoverable due to a computer malfunction or an earthquake, flood, or some other natural disaster.
One conventional technique for recovering backup data involves the maintenance of data in "duplex pairs." In a duplex pair configuration, each time data is written on a disk or some other storage media, a duplicate copy is written on a backup disk as well. One particular method of creating duplex pairs of data is the Peer-to-Peer Remote Copy (PPRC) procedure. Duplex pairs and PPRC are well known in the art and will not be discussed in detail here.
FIG. 1 illustrates a system which uses the PPRC procedure. The system 100 includes a primary host processor 102 which is connected to a primary subsystem 104 on which the data 106 is stored. The primary subsystem 104 is connected via cable 108 to a secondary subsystem 112 at a remote site on which a copy of the data 114 is stored. This secondary subsystem 112 could be connected to a secondary host processor but need not be. Each time data is written or changed on the primary subsystem 104, the primary subsystem 104 will transfer and write a copy of the data to the secondary subsystem 112. In this manner, the data and its duplicate are maintained in pairs. In this system, only the primary subsystem 104 can write to the secondary subsystem 112. Otherwise the data on the primary and the secondary subsystems will not be in sync. This would compromise the reliability of the backup data.
One problem with duplex pairs is the cumbersome nature of the recovery of data.
In the current state of the art, the operator 116 discovers data to be damaged or lost, typically through the issuance of a read command to the primary subsystem 104, via the primary host processor 102. Upon issuance of this command, the primary subsystem 104 responds to fetch the data and discovers the data loss. The primary subsystem 104 terminates the job and gives the primary host 102 a data error status. The primary host processor 102 recognizes the data error and notifies the operator 116 of the error. The operator 116 must manually access the secondary subsystem 112 and request to read the copy of the lost data. The operator 116 then manually commands the transfer of this data from the secondary subsystem 112 to the primary subsystem 104 and commands the rewrite of the data onto the primary subsystem 104. In this way, the lost data is recovered. The operator 116 then must restart the job.
The data recovery process is equally cumbersome when the secondary subsystem loses data. A secondary subsystem 112 will discover it has lost data during a typical operation, such as the management of available addressable space on a disk, commonly referred to in the field as "free space collection." This operation is well known in the art and will not be further described here. When a secondary subsystem 112 discovers it has lost data, it goes into a suspended state. It then broadcasts this changed state to the operator 116, who then must manually recopy the lost data from the primary subsystem and reestablish the duplex pair.
These methods of recovering data requires the termination of the job and the extensive involvement of the operator. Therefore, there is a need for a method of recovering lost data maintained in duplex pairs which does not require the termination of the job and does not require the involvement of the operator. The present invention addresses such a need.