Information drives business. For businesses that increasingly depend on data and information for their day-to-day operations, unplanned downtime due to data loss or data corruption can hurt their reputations and bottom lines. Businesses are becoming increasingly aware of the costs imposed by data corruption and loss and are taking measures to plan for and recover from such events. Often these measures include making backup copies of primary, or production, data, which is ‘live’ data used for operation of the business. Backup copies of primary data are made on different physical storage devices, and often at remote locations, to ensure that a version of the primary data is consistently and continuously available.
Two areas of concern when a hardware and/or software failure occurs, as well as during the subsequent recovery, are preventing data loss and maintaining data consistency between primary and backup data storage. Consistency ensures that, even if the backup copy of the primary data is not identical to the primary data (e.g., updates to the backup copy may lag behind updates to the primary data), the backup copy represents a state of the primary data that actually existed at a previous point in time. If an application completes a sequence of write operations A, B, and C to the primary data, consistency can be maintained by preventing the write operations from occurring in reverse order with respect to one another on the backup copy of the data. The backup copy should not reflect a state that never actually occurred in the primary data, such as would have occurred if write operation C were completed before write operation B. Some write operations in the set may occur concurrently, and some or all of the write operations may be committed atomically to achieve a consistent state of the data on the secondary node.
One way to achieve consistency and avoid data loss is to ensure that every update made to the primary data is also made to the backup copy, preferably in real time. Often such “duplicate” updates are made locally on one or more “mirror” copies of the primary data by the same application program that manages the primary data. Mirrored copies of the data are typically maintained on devices attached to or immediately accessible by the primary node, and thus are subject to failure of the primary node or corruption of data accessible via the primary node.
Therefore, making mirrored copies locally does not prevent data loss, and primary data are often replicated to secondary sites. Maintaining copies of data at remote sites, however, introduces another problem. When primary data become corrupted and the result of the update corrupting the primary data is propagated to backup copies of the data through replication, “backing out” the corrupted data and restoring the primary data to a previous state is required on every copy of the data that has been made. Previously, this problem has been solved by restoring the primary data from a backup copy made before the primary data were corrupted. Backup copies are commonly made on storage devices having the same access speed as the storage devices storing the primary data. Once the primary data are restored, the entire set of primary data is copied to each backup copy to ensure consistency between the primary data and backup copies. Only then can normal operations, such as updates and replication, using primary data resume.
The previously-described technique of copying the entire set of primary data to each backup copy ensures that the data are consistent between the primary and secondary sites. However, copying the entire set of primary data to each backup copy at secondary sites uses network bandwidth unnecessarily when only a small subset of the primary data has changed. Furthermore, copying the entire set of primary data across a network requires a significant amount of time to establish a backup copy of the data, especially when large amounts of data, such as terabytes of data, are involved. All of these factors delay the resumption of normal operations and can cost companies a large amount of money due to downtime.
What is needed is the ability to maintain consistent, up-to-date copies of primary data that enable quick resumption of operations upon discovery of corruption of the primary data or failure of the primary node.