Conventional hash based de-duplication relies on a one-to-one hash-based index. The conventional hash-based index maintains a one-to-one relationship between hashes and blocks previously processed and stored by a de-duplication process. The hash-based index supports making a binary duplicate/unique decision for a sub-block in one logical step. The hash of a block is used as a key into the index. If there is a value in an entry at the location identified by the key, then the block that generated the key (e.g., hash) is a duplicate. Data de-duplication may be referred to as “dedupe”.
Conventional dedupe includes chunking a larger data item (e.g., object, file) into sub-blocks, computing hashes for the sub-blocks, and processing the hashes instead of the sub-blocks. Chunking includes selecting boundary locations for fixed and/or variable length sub-blocks while hashing includes computing a hash of the resulting chunk. A chunk may also be referred to as a sub-block. Comparing relatively smaller hashes (e.g., 128 bit cryptographic hash) to make a unique/duplicate decision can be more efficient than comparing relatively larger chunks (e.g., 1 kB, 128 kB, 1 MB) of data using a byte-by-byte approach. Dedupe based on a unique/duplicate determination based on strong cryptographic hashes provides benefits for data reduction but may become expensive and infeasible with respect to data structure storage space and processing time.
The traditional dedupe index has maintained a one-to-one relationship between unique chunks and their hashes and related data. Over time, as the amount of indexed data has grown, conventional hash based processing and indexing have experienced challenges with respect to processing time and storage size. An index may grow so large that it is infeasible and/or impossible to hold it in the memory of a computer. Thus, conventional dedupe may suffer significant time delays when even a portion of the index is stored on disk instead of in memory. These challenges were previously unknown since indexes and stored data sets of the size, nature, and complexity of those processed by de-duplication applications were unknown.