Despite the overall improvement in the reliability of data storage devices (e.g., disk drives), storage devices fail and it remains necessary to implement backup systems to protect against data loss. In a typical backup system, a backup agent executing on a primary computer system identifies data to be backed up, and then communicates a copy of the identified data to a secondary computer system where it is stored. Accordingly, if data loss occurs as a result of a failed data storage device at the primary system, the data can be restored by copying data from the secondary system back to a new data storage device at the primary system.
One of the first requirements in a process for backing up data is to identify the data to be backed up. One way of identifying data to be backed up is to simply analyze a list of directories and/or files from one or more volumes selected by a user (e.g., target data). For example, utilizing this method, a backup agent executing at the primary computer system may periodically copy certain target files and/or directories from one or more volumes of the primary system to the secondary system. This method is inefficient because it does not take into consideration whether a target file or directory has changed since the last backup operation was performed. For instance, a target file may be copied from the primary computer system to the secondary computer system each time a backup operation is initiated, even if the target file has not changed in the time period between backup operations. As a result, the secondary system may store multiple copies of the same file, thereby wasting valuable storage space.
To avoid this inefficiency, many backup systems utilize some form of an incremental backup procedure. With an incremental backup procedure, an initial backup operation is performed to copy all user-selected directories and/or files, from one or more volumes, from a primary computer system to a secondary computer system. This initial backup is sometimes referred to as a baseline backup. After the baseline backup operation, a periodic incremental backup operation is performed to copy only those directories and/or files included in the user-selected set of directories and/or files that have changed since the baseline backup operation, or a subsequent incremental backup operation, was performed. Incremental backup operations may be file-based, in which case the entire file that has changed is included in the incremental backup, or block-based, in which case only the individual blocks (of the file) that have changed are included in the incremental backup. Because many files will only have minor changes from one backup operation to the next, in terms of conserving storage at the secondary computer system a block-level incremental backup scheme is generally more efficient than a file-based backup scheme.
With a block-level incremental backup scheme, there are several methods of identifying the particular blocks that have changed since the last backup operation and are therefore to be included in the current incremental backup. One way to identify the changed blocks is to perform a checksum operation on the individual data blocks. For example, the number of set bits in a particular data block may be calculated and compared to the number of set bits in that block from a previous checksum operation. A checksum value that changes over a period of time for a particular data block indicates the data block has changed. The problem with this approach is that it does not scale very well to work with large sets of data. The process of performing the checksum operation is processor intensive as it must be performed on all data blocks in the monitored set of directories and files—including those data blocks that have not changed. For instance, the processing time required to identify the changed data blocks is a function of the number of directories and files to be backed up, and the overall size of the changed files, without regard for how small a change is made to any particular file. Consequently, from the time a backup operation is initially requested, there may be a significant and undesirable processing delay as the backup agent attempts to identify the data blocks that have changed since the most recently completed backup operation.
Another way to identify the changed blocks is to analyze an attribute associated with each block. For instance, each block may have associated with it a modification time indicating the time at which the block was last modified, or an archive or backup bit, indicating when set (or cleared) that the block is to be included in the next backup operation. This requires that the operation for writing data to the data block also include logic for setting the appropriate attribute (e.g., the archive or backup bit). For many of the same reasons that the checksum operation is problematic, this approach also has drawbacks. Like the checksum method, this approach may also cause a significant delay between the time that a backup operation is requested, and the time that the data to be included in the backup are identified. The attribute that indicates whether a particular block is to be included in a backup must be read for each block in the monitored data set, regardless of whether the block is to be included in the backup. Accordingly, like the checksum method, this approach does not scale well to work with large data sets.
Yet another method for identifying changed data blocks involves logging the changes made as they occur. For instance, a logging process may generate a sequential change log indicating all the blocks in a monitored set of data that have changed. As a result, a long list of block information may be included in the change log. Here again, processing the change log may cause a significant delay from the time a backup operation is initially requested, particularly when a single sequential log is used for a large set of data. A sequential log file often results in duplicate log entries for the same block of data. For instance, a particular data block of a file may be changed several times between two backup operations, and only the last change made to the block is required to be included in the backup. In the log file, the block change information for two separate changes may be separated by several other log entries. Consequently, to determine the actual data for a particular block that has been changed multiple times, the entire log must be processed. Therefore, a more efficient mechanism for identifying data to be included in a backup is desirable.