Data deduplication may reduce the amount of storage space used in a storage system by detecting and preventing redundant copies of data from being stored to the storage system. For example, if multiple instances of a file exist in a deduplicated file system, a deduplicated data system may store a single instance of the file and link all instances of the file to the single stored instance. If one of the instances of the file is later modified, the deduplicated data system may break the link between the modified instance and the single stored instance and store the modified instance of the file separately.
Data deduplication involves identifying redundant copies of the same data. Because of the processing requirements involved in comparing each incoming unit of data with each unit of data that is already stored in a single-instance data storage system, redundant copy identification is usually performed by generating and comparing smaller data signatures (“fingerprints”) of each data unit instead of comparing the data units themselves. The detection of redundant copies generally involves generation of a new fingerprint for each unit of data to be stored to the single-instance data storage system and comparison of the new fingerprint to existing fingerprints of data units already stored by the single-instance data storage system. If the new fingerprint matches an existing fingerprint, a copy of the unit of data is likely already stored in the single-instance data storage system.
Unfortunately, traditional data deduplication techniques may perform poorly when some files within a deduplicated data system are archived. An archival system may archive a file by moving the file to an archival storage system and leave a placeholder file (e.g., a “stub” file) in the place of the archived file. When the archival system later identifies an attempt to access the archived file (i.e., the placeholder file), the archival system may retrieve the archived file from the archival storage system, overwriting the placeholder file. When the archival system replaces a deduplicated file with a placeholder file, a deduplication system may determine that the deduplicated file has changed, and the deduplicated file may lose its deduplication links. When the archival system subsequently retrieves the previously deduplicated file, the deduplication system may again detect a change and reprocess the previously deduplicated file (e.g., by generating a new fingerprint). Accordingly, the deduplication system may lose information about and perform redundant operations on archived files. Therefore, the instant disclosure identifies a need for additional and improved systems and methods for deduplicating archived data.