Data deduplication is a data compression technique for eliminating redundant data and is particularly useful to improve storage utilization, for example when backing up large amounts of data on a regular basis. When using chunk-based inline deduplication for backup, a data stream to be backed up is broken up into smaller pieces (typically on the order of a few kilobytes) called “chunks” using a chunking algorithm and a hash is computed for each chunk (e.g., using an MD5 hash function or SHA hash function). The hash for a chunk to be backed up is looked up in one or more indexes and/or other data structures maintained by the system. If the system determines as a result that it already has a chunk with that hash, then that chunk is a duplicate of data already stored in the system and need not be stored again. Thus, the hash for each incoming chunk is looked up in one or more indexes and/or other data structures maintained by the system.
In order to limit the amount of expensive memory required while maintaining performance, complicated methods of indexing may be used in practice. In one example (“sparse indexing method”), the system maintains a “sparse index,” which maps a small subset of hashes called “hooks” (e.g., one out of every 32 or 64 unique hashes) to an index with information about the chunk with that hook as well as chunks that occurred near that chunk in the past.
In another example (“Bloom filter method”), a Bloom filter tracks the hashes of chunks that are stored by the system and a full chunk index maps the hash of every chunk stored by the system to an index with information about the chunk with that hash as well as chunks that occurred near that chunk in the past. The full chunk index is only consulted when the Bloom filter determines that the input chunk has already been stored in order to reduce the number of accesses to the full chunk index.
A single node using one of the above-described methods provides acceptable performance for applications where the amount of data being backed up is low, or where high throughput is not needed. However, for enterprise-type applications where data backup requirements are much higher, employing multiple-node storage systems may be beneficial. One way to do this is to appropriately route incoming data between a number of mostly autonomous back-end nodes, each of which may be employing one of the above exemplary methods. Each batch of incoming data is deduplicated against only one back-end node under this architecture, so it is important to route similar batches of data to the same back-end node to ensure efficient deduplication to minimize the storage space impact of backing up data.