A computer system typically includes a computer file-system. The file-system could be a de-duplicated file-system.
Problems with Backing Up De-Duplicated File-Systems
Computer systems (e.g. server computer systems) need the ability to perform efficient data de-duplication on data. Backup solutions for computer file-system have been including some form of data “de-duplication” or data “redundancy elimination” algorithms. These algorithms can be used at the whole-file or at the sub-file level.
One of the most common approaches to sub-file de-duplication is to first break data streams (files) into chunks using a data fingerprinting algorithm, such as Rabin fingerprinting. Data fingerprinting algorithms can be set to produce chunks of an “expected size” based on parameters of the algorithm. Once the files are in chunks, a hashing algorithm is used to uniquely identify the content of each of those chunks. These unique identifiers are then placed into a queryable index. When a chunk is found which already exists in the file-system (found by querying the index or attempting an insert and getting a collision), that chunk can be replaced by a reference to that chunk, and “de-duplication” occurs. For each file that is chunked, a “blueprint” or chunk list is produced identifying how to reconstruct the file from its constituent parts.
One issue with this type of de-duplicated file-system is that the data storage format makes it very difficult to maintain this de-duplicated state when backing up to disjoint storage media (e.g., tape systems). Because of the interconnected nature of the data (object “blueprints” refer to multiple chunks, and de-duplicated chunks point back to multiple objects), backing up a de-duplicated system to disjoint storage media is difficult. Reading a single object may require mounting multiple storage media in order to read the data for the object.