Information drives business. For businesses that increasingly depend on data and information for their day-to-day operations, unplanned downtime due to data loss or data corruption can hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from data loss. Often these measures include protecting primary, or production, data, which is ‘live’ data used for operation of the business. Copies of primary data are made on different physical storage devices, and often at remote locations, to ensure that a version of the primary data is consistently and continuously available.
Typical uses of copies of primary data include backup, Decision Support Systems (DSS) data extraction and reports, testing, and trial failover (i.e., testing failure of hardware or software and resuming operations of the hardware or software on a second set of hardware or software). These copies of data are preferably updated as often as possible so that the copies can be used in the event that primary data are corrupted, lost, or otherwise need to be restored.
Two areas of concern when a hardware or software failure occurs, as well as during the subsequent recovery, are preventing data loss and maintaining data consistency between primary and backup data storage areas. One simple strategy includes backing up data onto a storage medium such as a tape, with copies stored in an offsite vault. Duplicate copies of backup tapes may be stored onsite and offsite. However, recovering data from backup tapes requires sequentially reading the tapes. Recovering large amounts of data can take weeks or even months, which can be unacceptable in today's 24×7 business environment.
More robust, but more complex, solutions include mirroring data from a primary data storage area to a backup, or “mirror,” storage area in real-time as updates are made to the primary data. FIG. 1A provides an example of a storage environment 100 in which data 110 are mirrored. Computer system 102 processes instructions or transactions to perform updates, such as update 104A, to data 110 residing on data storage area 112.
A data storage area may take form as one or more physical devices, such as one or more dynamic or static random access storage devices, one or more magnetic or optical data storage disks, or one or more other types of storage devices. With respect to backup copies of primary data, preferably the storage devices of a volume are direct access storage devices such as disks rather than sequential access storage devices such as tapes.
In FIG. 1A, two mirrors of data 110 are maintained, and corresponding updates are made to mirrors 120A and 120B when an update, such as update 104A, is made to data 110. For example, update 104B is made to mirror 120A residing on mirror data storage area 122, and corresponding update 104C is made to mirror 120B residing on mirror data storage area 124 when update 104A is made to data 110. As mentioned earlier, each mirror should reside on a separate physical storage device from the data for which the mirror serves as a backup, and therefore, data storage areas 112, 122, and 124 correspond to three physical storage devices in this example. If one of data storage areas 112, 122, and 124 is corrupted or suffers a loss of data, one of the other two mirrors can be used to provide the data.
FIG. 11B shows a potential problem that can occur when data are mirrored. Assume that after making update 106A to region 2 (R2) of data 110, computer system 102 crashes, as shown by the X through computer system 102. Neither region 2 of mirror 120A nor region 2 of mirror 120B is updated in corresponding transactions 106B and 106C, also shown by an X through the transaction. This failure leaves regions 2 of mirrors 120A and 120B in an inconsistent state from the state of region 2 of data 110. When computer system 102 returns online, data read from region 2 of data 110 are different from data read from corresponding regions 2 of mirrors 120A and 120B. Measures to recover from inconsistencies in mirrored data due to system crashes are necessary to restore data 110, mirror 120A, and mirror 120B to consistent states. Ensuring data consistency is critical to maintaining highly available data.
One method of restoring consistency between mirrors is to use one of the three sources of data—data 110, mirror 120A or 120B—as the valid copy and to copy data from the valid data source to the other two data sources. For example, data could be copied from data 110 to the two mirrors 120A and 120B. Typical prior art solutions have involved copying all of the data from the valid data source to the other data sources to ensure that all data are consistent. However, copying all data from snapshots can be unacceptably time-consuming when dealing with very large volumes of data, such as terabytes of data. In addition, copying large volumes of data diverts resources away from maintaining current versions of primary data during the restoration. A faster way to restore and/or synchronize large volumes of data is needed.
Various techniques have been developed to speed the synchronization process of two inconsistent sets of data. One technique involves taking a snapshot of source data such as data 110 at a given point in time, and then tracking regions changed in the source data with reference to the snapshot. Only the changed regions are copied to synchronize the snapshot with the source data. Such a technique is described in further detail in related application Ser. No. 10/207,461, filed on Jul. 29, 2002, entitled “Maintaining Persistent Data Change Maps for Fast Data Synchronization and Restoration” and naming Michael E. Root, Anand A. Kekre, Arun M. Rokade, John A. Colgrove, Ronald S. Karr and Oleg Kiselev as inventors, the application being incorporated herein by reference in its entirety.
A snapshot of data can be made by “detaching” a mirror of the data so that the mirror is no longer being updated. FIG. 2 shows storage environment 100 after detaching mirror 120B. Detached mirror 120B serves as a snapshot of data 110 as it appeared at the point in time that mirror 120B was detached. When another update is made to data 110, a corresponding update 106B is made to mirror 120A. However, no update is made to detached mirror 120B.
One solution to the problem of restoring data from a snapshot is to save the changes made to the data after the snapshot was taken. Saving the actual changes made to very large volumes of data can be problematic, however, introducing additional storage requirements. One way to reduce storage requirements for tracking changes is to use bitmaps, also referred to herein as maps, with the data divided into regions and each bit in the bitmap corresponding to a particular region of the data. Each bit is set to logical 1 (one) if a change to the data in the respective region has been made, and thus the bitmaps are sometimes referred to as data change maps. If the data have not changed, the respective bit is set to logical 0 (zero).
Accumulator map 210 is used to track changes made to data 110 after detached mirror (snapshot) 120B is detached. Three updates to data 110 are shown in the order in which the updates are made, including an update to region 2 (R2) in update 202, an update to region 6 (R6) in update 204, and an update to region 8 (R8) in update 206. Respective bits corresponding to respective regions R2, R6, and R8 are set to have a value of one in accumulator map 210 to indicate the regions that have changed in data 110 since detached mirror (snapshot) 120B was made.
The changes tracked by accumulator map 210 can then be applied in either direction. For example, the changes can be applied to the snapshot when there is a need for the snapshot to reflect the current state of the data. For example, referring back to FIG. 2, after update 202 is made to region 2 of data 110, region 2 of detached mirror (snapshot) 120B is no longer “synchronized” with data 110. To be synchronized with data 110, detached mirror (snapshot) 120B can be updated by applying the change made in update 202 to region 2 of detached mirror (snapshot) 120B. This change can be accomplished by copying the contents of data 110 to region 2 of detached mirror (snapshot) 120B.
Alternatively, to return to a previous state of the data before update 106A was made, the changed portion (region 2) of data 110 can be restored from (copied from) region 2 of detached mirror (snapshot) 120B. The change made in update 106A is thereby “backed out” without copying all of the data from the snapshot. The use of accumulator maps is described in further detail in the two related applications cited in the Cross Reference to Related Applications section of this application.
To save physical disk space, changes can be stored in temporary data storage areas such as volatile memory, but those changes are vulnerable to computer system, hardware, and software failures. In addition, storing the changes in temporary data storage areas typically requires that the snapshot and the data are stored in a common physical storage area that can be accessed by a common volatile memory. A requirement that the snapshot and the data be stored in a common physical data storage area can limit the number of snapshots that can be made of the data in organizations having limited resources or a very large amount of data. Furthermore, many applications suffer severe performance problems when more than one snapshot of a set of data is made due to the overhead involved in writing the data to multiple places.
What is needed is the ability to quickly synchronize mirrored copies of data that have become inconsistent. The solution should enable mirrored copies of data to be synchronized following a system crash without copying all of the data from one mirrored copy to another. Changes to the data should survive computer system, hardware and software failures and require minimal storage space. The solution should have minimal impact on performance of applications using the data.