Data backup has been used for decades to protect users from loss of data due to disk storage failure. Recently, loss of data in a network server due to disk storage failure has become infrequent due to techniques such as redundant arrays of inexpensive disk (RAID) and remote data mirroring. Unfortunately, data backup is still needed for recovery from data corruption due to software bugs, user error, malicious software, and unauthorized access to storage.
Data corruption often is detected a significant amount of time after it occurs. Therefore, if a data object is modified by user read-write write access over an extended period of time, it is desired to make a series of point-in-time copies of this data object over the extended period of time. The data object that is modified by the user read-write access is referred to as a “production” data object, and the point-in-time copies are referred to as “snapshot” copies. If data corruption is detected in the production data object, then the snapshot copies are inspected to find the most recent snapshot copy that has not been corrupted. If such a most recent snapshot copy is found, then the production data object is restored with data read from this most recent snapshot copy.
Snapshot copies have been created and maintained at various levels in a data processing system. For example, in the open source software Concurrent Versions System (CVS), snapshot copies have been created and maintained at the application level, on top of the file system level provided by a file server. In the CVS, the storage server stores a current version of a project and its history, and clients connect to the server in order to “check out” a complete copy of the project, work on this copy, and then later “check in” their changes. To maintain consistency, the CVS only accepts changes made to the most recent version of a file. If the check-in operation succeeds, then the version numbers of all files involved are automatically incremented, and the CVS server writes a user-supplied description line, the data and the author's name to its log files. Clients can also compare versions, request a compete history of changes, or check out a historical snapshot of the project as of a given date or revision number. Clients can also use an “update” command in order to bring their local copies up-to-date with the newest version on the server. CVS can also maintain different “branches” of a project, and uses delta compression for efficient storage of different versions of the same file.
Snapshot copies have been created and maintained at the logical block or track level, below the file system level provided by a file server, in a fashion that is concurrent with and transparent to client access. For example, as described in Kedem U.S. Pat. No. 6,076,148 issued Jun. 13, 2000, a backup bit map indicates the backup status of each track during a backup operation. The backup bitmap is used to generate one snapshot. A snapshot table can be used to generate of a number of snapshots on an overlapping basis. Each snapshot table entry is associated with a corresponding-indexed track, and includes a plurality of snapshot flags. The information from each track that is transferred to the backup system is accompanied by a copy of the snapshot entry associated therewith so that the information associated with each snapshot copy can be compactly aggregated into one or more backup cartridges, rather than being randomly distributed thereover, so that information from only a few cartridges need be retrieved if a restoration is required.
The creation of snapshot copies at the logical block level can also be done in such a way as to avoid the backup of logical blocks that are not actually used. See, for example, Armangau et al. U.S. Pat. No. 6,6792,518 B2 issued Sep. 14, 2004, and Tummala et al. U.S. Pat. No. 7,035,881 B2 issued Apr. 25, 2006.
The creation of snapshot copies can also be done at the file system level, so that file system indirect blocks and data blocks are shared among versions of a file. See, for example, Bixby et al., U.S. Patent Application Publication 2005/0065986 A1, published Mar. 24, 2005.
The decreasing cost of storage and processing resources is creating a problem of managing backup copies. It is becoming more costly to manage the backup copies than it is to create and store them. For example, it is convenient to backup everything that might be useful, and to defer the problem of finding the relevant information upon the unlikely but possible occurrence of data corruption. See, for example, Tzelnic et al. U.S. Pat. No. 6,366,987 B1 issued Apr. 2, 2003, in which a data storage system has an application interface that responds to a backup request by creating a catalog of information about the content of a physical storage unit that would be needed for restoring logical data structures from the backup version of the physical storage unit. Later, if and when the backup agent requests the restoration of a logical data structure, the application interface routine looks up the logical data structure in the catalog, issues a physical restore request to the data storage system to retrieve a backup version of the physical storage unit from backup data storage and load it into spare data storage, extracts the logical data structure from the physical storage unit in the spare data storage by performing a logical-to-physical translation, and restores the logical data structure into the current version of data storage.