In order to preserve data, an organization will routinely perform backup and archiving operations of its file systems. A backup operation may either be a full backup session, where every file is copied to archive media, or it may be an incremental backup session, where less than every file is copied. An organization may schedule periodic full backup sessions, and may include incremental backups between each full backup. Since an incremental backup only includes the changes to the file system since the previous backup session, it may take less time than a full backup.
During a backup session, data files are transmitted, or “streamed,” from a file system to a backup server. Backed up files may be stored in the backup server, or the backup server may coordinate writing the streamed data files to one or more archive media, such as a tape or disk. The resulting collection of backed up files produced after a backup operation is known as a backup image. Backup images are stored on tape, disk or other archive media and may be kept in a vault or other off-site location for disaster recovery purposes. Over time, an organization may collect large quantities of archive media containing these backup images.
In a large organization that uses a network of computers and servers, it is not uncommon for computers to share files or to store exact copies of files. For example, two or more computers may contain copies of certain operating system files, templates, emails or other data. Users may transmit data files to other users, who then keep these files without making any changes to them. As a result, the computers and servers on the network may have large amounts of data redundancies. As a further result, backups of these network files systems may also contain large amounts of data redundancies. Because of the space required for archive media, and the cost to retain long-term storage of backup images, needless copies of data files wastes valuable storage space and drives up the cost of archiving data.
Data redundancies may also occur between scheduled backup sessions, especially in cases when data files do not change between those backup operations. In other words, a file may be backed up multiple times in successive backup sessions, even though each backed up file is exactly the same in each subsequent backup image. This may occur in between full backup sessions, or even between incremental backup sessions. This is because any change to a data file, even an insubstantial change to a data file's metadata, may cause the file to be backed up even though that file's actual content may not be any different from the last backup session. The result may be a backup image that is substantially the same as a previous backup image. Such data redundancies between each backup image also wastes valuable storage space.
In response, some backup applications perform checks of file system data prior to backup or archive to ensure that unchanged files are not backed up more than once. Such efforts are called data deduplication or single instance storage. In other words, only a single copy, or “instance,” of a file is backed up. For example, during a backup session, a backup application enabled with deduplication software or hardware will read streamed data files to see if an instance of a file has already been backed up during that session. This may occur before the backed up version of that file is written to media. As a result, the deduplication software or hardware will permit storage of a single instance of a data file each time that file is encountered for the first time. Once that data file has been added to the backup image or written to media, the deduplication software or hardware will disregard any additional instances of the file, and will only retain a single instance. It does not matter if the other instances of a file exist on other file systems; if that file system has been included in the backup operation, only a single instance is stored. The deduplication software or hardware will track where the multiple instances of that data file occurred, so that during recovery, those multiple instances will be restored even though only a single instance was stored.
Because deduplication preserves single instances of files for backup, archives are more streamlined and require less archive media, and therefore less storage space. In addition, deduplication reduces network traffic, both during streaming and writing to the archive media, as well as during restoration of the backed up data objects. Once implemented, deduplication reduces the amount of memory required during backup.
However, deduplication is only available once a deduplication engine has been installed to work with a backup application. In other words, prior art deduplication methods only benefit future backup sessions. Currently, there is no way to perform deduplication on previously stored backup images. Once file system data objects have been backed up or archived to create a backup image and/or written to media, prior art deduplication utilities cannot determine whether multiple instances of data objects exist in the backup image. Further, there is no way to deduplicate between backup images, such that a file that does not change between successive backup sessions is stored as single instance. What is therefore needed is a way to extend the benefit of deduplication to previously stored backup images, thereby reducing the size of legacy archives and still preserving the integrity of data backed up.