1. Field of the Invention
The present invention relates generally to de-duplication, and in particular to reducing identification of chunk portions in data de-duplication.
2. Background Information
De-duplication processes partition data objects into smaller parts (named “chunks”) and retain only the unique chunks in a dictionary (repository) of chunks. To be able to reconstruct the object, a list of hashes (indexes or metadata) of the unique chunks is stored in place of original objects. The list of hashes is customarily ignored in the de-duplication compression ratios reported by various de-duplication product vendors. That is, vendors typically only report the unique chunk data size versus original size.
The list of hashes is relatively larger when smaller chunks are employed. Smaller chunks are more likely to match and can be used to achieve higher compression ratios. Known de-duplication systems try to diminish the significance of index metadata by using large chunk sizes, and therefore, accept lower overall compression ratios. Also, standard compression methods (LZ, Gzip, Compress, Bzip2, etc.) applied to the list of hashes perform poorly.