Aiming to eliminate the data redundancy and hence reduce data storage footprint and/or data transfer volume, data deduplication has become an indispensable feature in almost all the storage archive/backup systems and many front-end computing/storage systems. The basic principle of data deduplication can be descried as follows. First, files are split into multiple chunks, where all the chunks may have the same or different size (typically at least a few kB and larger) dependent upon whether content awareness is incorporated into the chunking process. In general, content-aware chunking (hence variable chunk size) tends to achieve better data deduplication efficiency. Content-aware chunking is typically realized by using certain rolling hash schemes such as Rabin fingerprinting.
Given each data chunk, data deduplication aims to discover and accordingly exploit the redundancy between a current chunk and the other chunks that have been stored or transferred in the system. This is realized in either a locality-based or similarity-based manner, where the former only focuses on exactly identical chunks and the latter considers both identical and highly similar chunks.
In the context of locality-based data deduplication, for each individual chunk, a locality-oriented signature (or fingerprint) with a reasonably small size (e.g., 20 bytes or 32 bytes) is calculated using a hash function (e.g., SHA-1 or SHA-256). The signature is used to determine whether two chunks are identical (i.e., if the signatures of two same-sized chunks are identical, then the two chunks are considered to be identical).
In the context of similarity-based data deduplication, more complicated hashing schemes are used to calculate similarity-oriented signatures for each chunk.
The signatures are used to determine whether two chunks are identical or highly similar. Once multiple identical data chunks are identified, the system can physically store or transfer only a single copy through appropriate data structure management. For similarity-based data deduplication, once multiple highly similar data chunks are identified, the system can only physically store or transfer a single copy and the inter-chunk differences through appropriate data structure management.
The system maintains a signature index consisting of all or portion of all the data chunks that have been stored or transferred. One critical process is to, given the signature of a new data chunk, determine whether this signature already exists in a current signature index. Such a signature index look-up operation can be very time and resource consuming and hence degrade the overall data deduplication speed performance, especially for large-scale systems. In practical implementation, a Bloom filter is typically used to eliminate unnecessary signature index look-ups and hence speed up the data deduplication. The objective of a Bloom filter is to, without carrying out any index look-up, quickly identify those signatures that are definitely not in current signature index. This can obviate a significant amount of costly and unnecessary signature index look-up operations. The core operation of a Bloom filter is to apply several (e.g., k) hash functions onto the signature in order to obtain k integers, h1, . . . , hk, whose values all fall into a given range [0, m-1]. If any one out of the k bits at position h1, . . . , hk in the m-bit summary vector is 0, then it is guaranteed that this signature is not in current signature index. For each signature being added to the signature index, the corresponding k bits in the summary vector should be set as 1.