Over the last few years, data deduplication has become one of the most widely researched topics in the field of storage systems. It enables significant savings as the required storage space can be reduced up to 20 times, especially for backup usage. In addition to capacity optimization, deduplication may also optimize write bandwidth. If a system provides inline deduplication (performed during writing data) and verifies equality of chunks by comparing their hashes only, the data of duplicated chunks do not need to be stored on disk or even transmitted through network. However, providing an effective way to identify duplicates is not simple.
Consider a sample single-node disk-based storage system with reliable, inline deduplication. We assume a 2u storage node with 12 1 TB disks for a total of 12 TB disk space per node. Deduplication is done on chunk level by comparing hashes of their content. Related work indicated a chunk size of 8 kB as a reasonable choice. To provide deduplication with this chunk size, we need a dictionary for 1.5 billion entries. Keeping only hashes for them will consume 30 GB for SHA-1 or 50 GB for SHA-256, and will not fit into RAM of a reasonable size.
Current systems implement the dictionary as a disk-resident hash table. However, hashes of data chunks are uniformly distributed and there is no locality while accessing them. This makes straight caching ineffective and causes random reads from disks during lookup. NPL 1 and 2 suggest a combination of two optimization techniques.
1. To avoid disk access during lookup of chunks not present in the system, all hashes are summarized in an in-memory bloom filter. This speeds up negative answers.
2. Prefetch assumes that the order of writing duplicates will be the same as the order of writing original chunks. Hashes are additionally kept in special files which reflect the order they were initially written. This speeds up positive answers, but only if the order is preserved.