Data deduplication (sometimes referred to as data optimization) refers to detecting, uniquely identifying and eliminating redundant data in storage systems and thereby reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware and power costs (for storage), data management costs (e.g., reducing backup costs) and network bandwidth costs. As the amount of digitally stored data grows, these cost savings become significant.
There are a variety of techniques and granularity levels for eliminating redundancy within and between persistently stored files. Fixed-size chunking, in which a fixed size block or chunk of a file is deduplicated, is an improvement over file-level chunking in which an entire file is treated as a chunk. However, fixed-size chunking fails to handle certain conditions, such as an insertion or deletion of data at the beginning or in the middle of a file, in terms of being able to detect unchanged portions of the data after the insertion or deletion edits (due to a data shifting effect). Variable-size chunking addresses these failures, but at the cost of additional processing. Most variable size chunking techniques employ content aware chunking, which is a useful feature of many high efficiency storage and communication protocols.
It is highly desirable that any system implementing content aware chunking achieves extremely high throughput (e.g., capable to process one or more Gbps per CPU core, and ten or more Gbps via hardware assistance) as well as a desired chunk size distribution. Further, having very small chunks and very large chunks are undesirable. Very small chunks result in lower deduplication savings leading to high overhead during indexing and/or communicating. Very large chunks may exceed the allowed unit cache/memory size, which leads to implementation difficulties. Having very large chunks also make it more difficult to find matching chunks and may also result in reduced deduplication savings. Moreover, it is desirable to have a smooth probability distribution of chunk sizes to optimize savings while maintaining low processing complexity.