Information drives business. For businesses that increasingly depend on data and information for their day-to-day operations, unplanned downtime due to data loss or data corruption can hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from data loss. Often these measures include protecting primary, or production, data, which is ‘live’ data used for operation of the business. Copies of primary data are made on different physical storage devices, and often at remote locations, to ensure that a version of the primary data is consistently and continuously available.
Typical uses of copies of primary data include backup, Decision Support Systems (DSS) data extraction and reports, testing, and trial failover (i.e., testing failure of hardware or software and resuming operations of the hardware or software on a second set of hardware or software). These copies of data are preferably updated as often as possible so that the copies can be used in the event that primary data are corrupted, lost, or otherwise need to be restored. Ensuring data consistency is critical to maintaining highly available data. The terms “consistent” and “consistency” are used herein to describe a backup copy of primary data that is either an exact copy of the primary data or an exact copy of primary data as the primary data existed at a previous point in time, which is referred to herein as a “snapshot.”
Two areas of concern when a hardware or software failure occurs, as well as during the subsequent recovery, are preventing data loss and maintaining data consistency between primary and backup data storage areas. One simple strategy to achieve these goals includes backing up data onto a storage medium such as a tape, with copies stored in an offsite vault. Duplicate copies of backup tapes may be stored onsite and offsite. However, recovering data from backup tapes requires sequentially reading the tapes. Recovering large amounts of data can take weeks or even months, which can be unacceptable in today's 24×7 business environment.
Other types of data storage areas take form as one or more physical devices, such as one or more dynamic or static random access storage devices, one or more magnetic or optical data storage disks, or one or more other types of storage devices. With respect to backup copies of primary data, preferably the backup storage devices are direct access storage devices such as disks rather than sequential access storage devices such as tapes. Because disks are often grouped to form a logical storage volume that is used to store backup copies of primary data, the term “storage area” is used interchangeably herein with “storage volume;” however, one of skill in the art will recognize that the systems and processes described herein are also applicable to other types of storage areas and that the use of the term “storage volume” is not intended to be limiting. A storage volume is considered to be made up of regions. A storage volume storing the primary data is referred to herein as a primary volume, and a storage area storing a backup copy of the primary data is referred to herein as a backup volume or a secondary volume. A storage volume storing a snapshot of the primary data is referred to herein as a snapshot volume. A node in a network managing the primary data/volume is referred to herein as a primary node, and a node in the network maintaining backup copies of the primary data but not the primary data itself is referred to herein as a secondary node.
One way to achieve consistency and avoid data loss is to ensure that every update made to the primary data is also made to the backup copy, preferably in real time. However, when a primary volume becomes corrupted and the result of the update corrupting the primary data is propagated to backup volumes, “backing out” the corrupted data and restoring the primary data to a previous state is required on every copy of the data that has been made. Previously, this problem has been solved by restoring the primary volume from a snapshot volume made before the primary data were corrupted. Once the primary volume hosting the primary data is restored, the entire primary volume is copied to each backup volume to ensure consistency between the primary data and backup copies. Only then can normal operations, such as updates and replication, of the primary volume resume.
One reason that the entire primary volume is copied to each backup location is that some applications, such as database applications, require that the updates made to the primary data are made to the backup copy of the primary data in the same order. For example, consider a database maintaining an inventory of 20 items. Assume that an order is received for 15 items, updating the number of items in inventory to 5. Assume then that an order is received for 7 items, 5 items are shipped to fulfill the order, updating the number of items in inventory to 0, and the remaining 2 items are placed on back order. If the backup copy of the inventory also starts with 20 items, and the order for 7 items is applied first, the backup copy is updated to reflect an inventory of 13 items, which is a state never reached in the primary data. If at this point, the primary data were corrupted, and the backup copy showing an inventory of 13 items is used to restore the primary data, data about the correct number of items in inventory are lost.
To maintain a backup copy that ensures write ordering without copying the entire primary volume to each backup location, one technique is to send each update to another instance of the database application on the secondary node, and the secondary instance of the database application can apply the updates in order to the copy of the primary data maintained on the secondary node. However, maintaining duplicate application resources at the secondary nodes can be inefficient, particularly when the secondary nodes serve only as backup storage locations for the primary data.
The previously-described technique of copying the entire primary volume solves the write-ordering problem and enables the corrupted primary-data to be restored on every backup copy without requiring that secondary nodes be used to re-process the updates to the data. However, copying the entire primary volume to each secondary volume uses network bandwidth unnecessarily when only a small subset of the primary data has changed. Furthermore, copying the entire primary volume across a network requires a significant amount of time to establish a backup copy of the data, especially when large amounts of data, such as terabytes of data, are involved. All of these factors delay the resumption of normal operations and can cost companies a large amount of money due to downtime.
What is needed is the ability to quickly synchronize copies of a single source of data that have diverged over time. The solution should enable copies of data to be synchronized without copying all of the data from one valid copy to each invalid copy, and yet maintain consistency of data without requiring duplicate resources at each secondary node. The solution should use minimal resources to maintain data consistency and have minimal effect on performance of applications using the data and on network usage.