In a deduplicating storage system, content is typically divided into variable-sized “chunks” based on characteristics of the data. If a hash of a chunk, also known as a fingerprint, matches that of a chunk already stored in the system, the chunk is known to be a duplicate. The goal of using variable-sized chunks is to isolate changes so that a modification that shifts data up or down in a file will not cause all subsequent pieces of the file to be different from the earlier version. Chunks have a target average size, such as 8 KB, with minimum and maximum sizes constraining the size of any specific chunk.
By using different sized chunks, a system can trade off deduplication effectiveness against overhead cost. When there are long regions of unchanged data, a smaller chunk size has little effect, since any chunk size will deduplicate equally well. Similarly, when there are frequent changes, spaced closer together than a chunk, all chunks will be different and fail to deduplicate. But when the changes are sporadic relative to a given chunk size, having smaller chunks can help to isolate the parts that have changed from the parts that have not, and the overall compression achieved from deduplication is improved.
At the same time, since every chunk requires certain metadata to track its location, and the mapping of files to chunks must enumerate more chunks if the chunks are smaller, the per-chunk overhead scales inversely with the chunk size. More data must be stored, and more chunks must be looked up in the system; i.e., there is additional storage overhead and computational overhead as a result of smaller chunks. There has been a lack of efficient mechanism to determine a chunk size that provides the best balance between deduplication effectiveness and overhead. Further, when replicating data from a source storage system to a target storage system having different chunk sizes, data chunks are typically replicated without considering the average chunk size of the target storage system. Such a replication may have an impact on the performance of the target storage system.