Most data de-duplication techniques focus on the full backups, where all logical blocks of a logical volume are de-duplicated with existing stored blocks even if only a small portion of all logical blocks have been changed.
In one known sparse indexing scheme, a sampling fingerprint index instead of a whole fingerprint index is used to fit the fingerprint index into RAM for a D2D data backup system. The insight of the sampled fingerprint index is that duplicated data blocks tend to be in a consecutive range with a non-trivial length. A match of sampling fingerprint values in the range indicates the matching of the whole range with a high probability. Among all matched ranges, a champion is chosen to be the de-duplication target. The sparse indexing scheme samples the fingerprints based on a fixed sampling rate. In another known scheme, fingerprints are sampled based on temporal locality of a segment, entries of a segment in the sparse chunk index are deleted or pruned if a criteria is met (a period of time, a fixed count). Each entry in the sampled fingerprint points to up to R containers. In other words, this scheme leverages a least recently used (LRU) list to reduce sampling rate of containers.
In another approach, a bloom filter is employed to determine if a fingerprint is not in the fingerprint index, which blindly records all fingerprints of all blocks regardless of the history of these blocks. Another known scheme is that each file has a whole-file hash, which means that a match of the whole-file hash indicates that the whole file is a duplicate or individual fingerprints do not need to be checked. In this approach, de-duplication is based on files. Each file has a representative fingerprint. Entries in the fingerprint index are distributed to K nodes one by one based on modular operation, or other distributed hash table functionality. The whole container corresponding to a fingerprint index entry is distributed to the same node as the fingerprint index entry. If two fingerprint index entries happen to have the same container but are distributed to two different nodes, the same containers are duplicated on two nodes. When a file is backed up, only the representative fingerprint is used to route the file to a node, all other fingerprints of the file are not used for routing purpose.
Some known US patents provide de-duplication techniques. For examples, in one prior art, fingerprint records of data blocks are distributed based on a portion of the fingerprint value. Fingerprint records have the full information for data de-duplication. There is no concept of fingerprint container. In another prior art, a modified version of a fixed prefix network (FPN) is used to distribute data blocks with a predefined data redundancy level. Because the storage is a content-addressable storage, there is no fingerprint index and containers. Yet in another prior art, blocks are distributed based on fingerprints. At the first step, blocks are divided into partitions based on k Least-Significant-Bit of the fingerprint key. At the second step, partitions are mapped to a physical node using a DHT. Each data node has a queue of encrypted keys, and a request searches the queue for duplicate discovery.