1. Field of the Invention
The present invention relates generally to de-duplication, and in particular to optimizing data chunking segment size.
2. Background Information
De-duplication methods partition an input object (or stream) into smaller parts such as blocks/segments, known as “chunks”, and retain only the unique chunks in a repository. Conventionally, there are different ways to chunk an object, such as fixed size chunks, content dependent using fingerprints, etc. A limitation of such chunking methods is that regardless of the chunking method employed, de-duplication performance (compression ratio) is better when the chunk sizes are smaller.
Smaller chunks, however, require more accesses to the repository (e.g., disk drive) when reconstructing an object (a problem known as “fragmentation”), and relatively more entries in the repository (repository of chunks). Conventional de-duplication systems normally use a “one size fits all” approach, failing to adapt the chunk sizes to variation in compressibility of a given workload.