1. Problems with De-Duplicated File-Systems
Computer systems (e.g. server computer systems) need the ability to perform efficient data de-duplication on data. Backup solutions for computer file-system have been including some form of data “de-duplication” or data “redundancy elimination” algorithms. These algorithms can be used at the whole-file or at the sub-file level.
2. Prior Art Systems
Referring to FIG. 1, a common prior art system for performing sub-file de-duplication is to first break data streams (files) into chunks using a rolling-hash algorithm, such as Rabin fingerprinting. Rolling-hash fingerprinting algorithms can be set to produce chunks of an “expected size” based on parameters of the algorithm. Once the files are in chunks, a collision-resistant hashing algorithm is used to uniquely identify the content of each of those chunks by generating unique identifiers.
These unique identifiers are then placed into an index that can be queried. When a chunk is found which already exists in the system (found by querying the index or attempting an insert and getting a collision), that chunk can be replaced by a reference to that chunk, and “de-duplication” occurs. For each file that is chunked, a “blueprint” (references to unique chunks) is produced identifying how to reconstruct the file from its constituent parts or chunks.
The size of the data chunks has a dramatic effect on the de-duplication rates. As the size of the chunks decreases, the de-duplication rate increases, and the size of the unique content identifier index and the size of the “blueprints” used to reconstruct objects from their parts increase, resulting in slower object “reconstruction” from the “blueprints”. As the size of the chunks increases, the size of the unique content identifier index and the size of the “blueprints” decrease while the object “reconstruction” from the “blueprints” becomes faster.
One problem with this prior art de-duplication system relates to certain types of files (e.g., JPEGs, MPEGs, other similarly compressed media files) that are very unlikely to have duplicate sub-file chunks with other files. Specifically, these types of files can use a very large chunk size, whereas other files in the system can use a smaller chunk size. Other factors may also influence whether a certain file type is a candidate for smaller or larger chunking. For example, file types that are typically “consumed” by the user(s) of the system containing these files are generally poor candidates for data de-duplication, and, thus, would benefit from larger block or whole-file de-duplication.
File types that are typically “produced” by the user(s) of the system containing those files would typically be good candidates for data de-duplication. This is because such files will often be found in different stages of preparedness (e.g., drafts, modifications, alterations). For example, movie files (such as those available from YouTube, LLC) often have multiple versions of the same movie and would benefit from smaller-block data de-duplication.
A solution is required which can achieve high speeds without sacrificing de-duplication rates. Therefore, a method and system of performing de-duplication for at least one computer file in a computer system is needed.