1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method of error recovery for backup applications.
2. Description of Related Art
Modern storage controllers have several copy service features, such as point-in-time copy, e.g., FlashCopy, synchronous and asynchronous controller-to-controller remote copy, and the like. The rationale behind having these features in storage controllers is to relieve the application and/or data servers of the burden of creating backup copies of the data volumes. Creation of backup copies of data volume using storage copy services have to be synchronized with the operation system and application activities to ensure the integrity and consistency of the copies.
Several database applications, such as DB2 and Oracle, have well defined hooks to synchronize the creation of backup volumes using copy services. This is achieved usually by executing a set of commands to flush the application buffers to the disk and put the application in back up mode thereby providing a window for the backup client or the operator to trigger the copy services operations necessary to generate hardware copy. Recently, vendors have integrated similar framework into the operating system itself to achieve greater consistency of the copy. Microsoft's Volume Shadow copy Service (VSS) for Windows 2003 is one such framework.
Taking a consistent copy of a set of source volumes using copy services of the storage controllers is a complex task and involves several interactions with storage controllers. In some situations, such as with VSS for example, the copies, once created, will have to be attached to the client machine so that a backup application can transfer the data on to a tape device for permanent storage. After that transfer the copies have to be detached from the host. This attaching and detaching of volumes requires additional interactions with the storage controller. Moreover, these operations have to be repeated for every data volume and, since modern client machines typically have several data volumes, such operations create a large overhead.
For example, a typical storage controller that supports point-in-time copies, performs the following series of operations:
1) collect details of the source volume(s);
2) create or select target volume(s) for the copy;
3) create special objects, such as consistency groups, within the storage controller(s) in order to guarantee the consistency of the copy;
4) create a point-in-time copy object in the storage controller for each source volume;
5) prepare the source volume(s) for copying;
6) create the copy when the operating system and/or the application are in a consistent state;
7) attach the copy to the client machine if needed by the framework/application;
8) start copying the data from the copy to tape if necessary;
9) detach the target volume from the client machine when step 8 is complete;
10) remove all of the special copy services objects, such as the point-in-time copy objects and consistency objects created for this backup operation from the storage controller; and
11) remove the copy and reclaim the space if the backup was not for long term preservation, such as one used just to transfer the consistent data to another medium, e.g., tape.
Each of the operations listed above, with the exception of operation 8 above, requires one or more interactions with the storage controller. Moreover, the above operations need to be repeated for each source volume since storage controller commands are operated upon one volume at a time.
In short, taking a backup of a system using copy services is a complex procedure that is difficult, if not impossible, to do manually and thus, it is desirable to put the logic for performing such an operation in a separate, dedicated application which can be invoked automatically. Frameworks such as VSS already support such automation by providing application program interfaces (APIs) for storage vendors to implement these operations.
Error recovery is an important issue in developing such a backup application. If during a backup operation using copy services, the application detects an error condition, all modifications performed so far by the storage controller during the backup operation need to be rolled back. Otherwise, the resources created so far would be wasted. In particular, the following error conditions need to be detected and recovered from:
1) a transient condition in the storage controller, such as lack of sufficient resources, e.g., not enough free storage space for target volumes, that makes continuing the copying operation impossible or generates an incomplete copy;
2) a transient condition in the client machine, e.g., operating system or application was not able to guarantee the consistency of the data, that requires the backup operation to be aborted;
3) communication to the storage controller is lost;
4) application host crashes; and
5) storage controller crashes.
Currently, these types of error recovery operations are implemented in the backup client application and usually are done by keeping a list of all the changes made in the storage controller and rolling back these changes. This approach has the following disadvantages:
1) recovery code is being replicated in every instance of the backup client application making code maintenance difficult. Moreover, every solution developer is forced to spend time and resources in providing a new backup client application with every product;
2) this approach is not easy to automate and is difficult to administer. In the case of a storage controller crash or loss of communication by the client machine, for example, the backup client application cannot rollback changes immediately. This has to be done manually when the storage controller comes back online. To make the recovery automatic, the backup client application has to take care of this situation as well, which requires maintaining the log for a longer duration until the storage controller restarts; and
3) if the backup client application host crashes during a backup operation, no recovery is possible until the host system reboots. Until recovery code is executed, the storage resources are locked up and not available for other client machines. This can have a significant impact if resources are shared among multiple clients in a rolling fashion. For example, the same storage space may be used for backing up data (and then moving the data to a tape device) from several host systems one after the other. In this scenario, a crashed host system which does not reboot immediately may prevent other client machines from backing up their data.