Conventional hash based de-duplication is based on making a binary duplicate/unique decision for a sub-block in one logical step using a hash-based index. A sub-block of data to be analyzed is hashed and the hash is used to either find the duplicate stored block of data or to determine that the block is not a duplicate. This conventional hash based de-duplication addresses the facts that: storing data takes time and space; transmitting data takes time and bandwidth; and both storing data and transmitting data cost money. Yet more data to be stored and/or transmitted is constantly being generated. The rate at which the amount of data is expanding may be exceeding the rate at which storage space and transmission bandwidth are growing. Furthermore, while the amount of data to be stored and/or transmitted is growing, the amount of time available to store and/or transmit data remains constant. Therefore, efforts including the hash-based data de-duplication have been made to reduce the time, space, and bandwidth required to store and/or transmit data. Data de-duplication may be referred to as “dedupe”.
Conventional dedupe has included chunking a larger data item (e.g., object, file) into sub-blocks and processing hashes of the sub-blocks. Chunking includes selecting boundary locations for fixed and/or variable length sub-blocks while hashing includes computing a hash of the resulting chunk. A chunk may also be referred to as a sub-block. Comparing relatively smaller hashes (e.g., 128 bit cryptographic hash) can be more efficient than comparing relatively larger chunks (e.g., 1 kB, 128 kB, 1 MB) of data using a byte-by-byte approach. Therefore, conventional data reduction involved chunking larger pieces of data into smaller chunks, computing fingerprints (e.g., hashes) for the smaller chunks, and then comparing fingerprints of the smaller chunks so that duplicate chunks did not have to be stored. The fingerprints were compared to fingerprints of known, stored sub-blocks to determine whether the chunk was unique and should be stored and indexed or whether the chunk was a duplicate and could be accounted for without storing and indexing. Fingerprints were even transmitted between computers in lieu of actual sub-blocks or blocks to reduce bandwidth consumption.
Conventionally, after boundaries for a chunk were placed, the data between the boundaries was hashed, often with a strong cryptographic hash. The hash was then used both to identify the chunk and to determine whether the chunk was a duplicate of an already stored chunk. Unique chunks were stored and index entries were updated to record the fact that the sub-block became known, that its hash had been stored, and that the actual storage location of the physical sub-block became known and available. Dedupe based on absolute uniqueness as determined by strong cryptographic hashes provides benefits for data reduction but may become expensive and infeasible with respect to data structure storage space and processing time.
Comparing fingerprints is facilitated by indexing the fingerprints to facilitate retrieval, comparison, and searching. Dedupe indexes and other dedupe data structures have traditionally been populated with data that facilitated storing unique chunks, locating unique chunks, and making duplicate determinations. The traditional dedupe index has maintained a one-to-one relationship between unique chunks and their hashes and related data. Over time, as the amount of indexed data has grown, conventional hash based processing and indexing have experienced challenges with respect to processing time and storage size. An index may grow so large that it is infeasible and/or impossible to hold it in the memory of a computer. Thus, conventional dedupe may suffer significant time delays when even a portion of the index is stored on disk instead of in memory. Additionally, when the index grows, it becomes increasingly difficult, or impossible, to distribute the index across collaborating nodes. These challenges were previously unknown since indexes and stored data sets of the size, nature, and complexity of those processed by de-duplication applications were unknown.