Conventional data de-duplication (dedupe) involves identifying whether two chunks of data are identical. Identical data does not need to be stored or transmitted. Instead, information (e.g., a reference) identifying identical data can be stored or transmitted. When the information about the data consumes less space or transmission bandwidth than the data, then space or transmission bandwidth is saved.
Conventional dedupe tends to operate in a binary manner. Either a chunk is a duplicate or a chunk is not a duplicate. Duplicate chunks are not stored, unique chunks are stored. Additionally, conventional dedupe tends to rely on strong, wide cryptographic hashes to determine whether chunks are duplicates. Storing or transmitting strong, wide cryptographic hashes consumes at least a part of the memory and/or bandwidth that dedupe is trying to save. Furthermore, indexing chunks based on strong, wide cryptographic hashes can consume limited random access memory (RAM). When a large number of chunks are indexed with wide cryptographic hashes, the index can consume more memory than is available in an indexing machine.
Storage space for storing chunks and for storing indexing material is limited. While plentiful storage (e.g., disk, tape) may be suitable for storing chunks, less plentiful storage (e.g., random access memory (RAM)) may be suitable for storing indexing material and/or fingerprints (e.g., cryptographic hashes). Conventional indexes may have grown so large that they overflowed memory and required portions of the indexing material to be stored elsewhere (e.g., on disk). Storing indexing material on disk can slow down duplicate determinations. Attempts to store larger chunks may have lead to fewer duplicate chunks being found. Attempts to store wide cryptographic hashes may have increased the amount of memory and/or disk space required to store indexing material. Attempts to store smaller chunks may have lead to more duplicate chunks being found, but at the expense of storing more cryptographic hashes. Conventional systems may have wrestled with competing goals of being able to quickly determine whether a chunk is a duplicate using in memory indexing material while at the same time storing less data and indexing material.