It is common for enterprises to backup their data from time to time. For instance, a business may have one or more storage volumes that are backed up daily or weekly in order to preserve records and/or to provide data recovery in the event that one or more of the storage volumes becomes inoperable or inaccessible.
Backup devices may store very large amounts of data, and therefore it may be desirable in some instances to perform deduplication before backing up data from a primary volume to a backup volume. To the extent that data can be deduplicated, the removal of the duplicate data may in some cases provide significant storage space savings.
Some conventional techniques for network storage implement file systems that employ pointers to point to the underlying data. The underlying data is arranged in data blocks. A given file may point to multiple blocks, and a block may be associated with multiple files. Furthermore, a given file may include data that is duplicated in another file. For instance, a storage volume may include multiple email inboxes, each inbox including a particular email attachment. In most scenarios it would be undesirable to backup multiple copies of the email attachment because doing so would be wasteful of storage resources. Some conventional deduplication operations avoid saving multiple copies of a piece of data by keeping only a single copy of the data and replacing the duplicate copies with pointers to the single copy. Therefore, multiple files are associated with the same data, but duplicate copies of the data are avoided.
Deduplication operations may include a change logging function to indicate the data that is being added to a backup volume. To implement the change logging function a particular deduplication operation may include a fingerprinting process to create an identifier for each data block that is to be backed up. In some examples, the fingerprinting process includes a hash operation to create a data string for each data block—if two blocks have the same data string (i.e., have the same fingerprint) it is an indication that the blocks are probably the same. The change log includes the fingerprints of the data that is to be backed up, and a backup manager application can compare the fingerprints to each other to determine whether any of the data blocks listed in the change log are duplicates.
However, as backup operations become more complex and sophisticated, such simple, conventional deduplication operations may benefit from updating.