1. Field of the Invention
This invention relates generally to data backup software for computer systems. More particularly, the invention relates to backup software which operates to efficiently backup files in a de-duplication storage system.
2. Description of the Related Art
Large organizations often use backup storage systems which backup files used by a plurality of client computer systems. The backup storage system may utilize data de-duplication techniques to avoid the amount of data that has to be stored. For example, it is possible that a file changes little or not at all from one backup to the next. De-duplication techniques can be utilized so that portions of the file data which have already been backed up do not need to be backed up again. The file may be split into multiple segments, and the file segments may be individually stored in the backup storage system as segment objects. When a new version of the file is backed up, the backup software may check whether or not segment objects representing the current file segments are already stored in the backup storage system. Each segment object which is already stored may be referenced again without storing a new duplicate of the segment object.
If a particular version of a file is deleted from the backup storage system, the underlying segment objects referenced by the version also need to be deleted, but only if they are not referenced by other versions of the file (or referenced by other files). The backup software may store reference information for each segment object to decide when the segment object can be deleted. When each respective version of the file is added to the system, the reference information for each segment object used by the respective version may be updated to indicate that it is used by the respective version. Similarly, when each respective version of the file is deleted from the system, the reference information for each segment object used by the respective version may be updated to indicate that it is no longer used by the respective version. When the reference information for a given segment object indicates that it is no longer used by any versions of any files then the given segment object can be deleted.
Unfortunately, updating the reference information for each segment object can be inefficient. For example, consider a large database file several hundred gigabytes in size. It is likely that only a small percentage, e.g., 10%, of the segments of the file change from one backup to the next. Although the 90% of the segments which are unchanged can be re-used, the reference information for each one still needs to be updated, which adds significant performance overhead to the backup operation.
Some backup storage systems need to update the reference information for existing segment objects tens or hundreds of millions of times each day. In some systems, the time needed to update the reference information is a majority of the overall time needed to perform the backup operations. Thus, updating the reference information is a limiting factor in the scalability of some de-duplication storage systems.