A data de-duplication system can remove redundant data of data storage, such that more data may be stored in an existing storage capacity, and the total storage overhead would be reduced by replacing the redundant data into a pointer or link.
The existing data de-duplication system may employ a data chunk-based redundant data deletion technology. In a data chunk partition stage, a sliding window is introduced to determine boundaries between data chunks. For example, data fingerprints of data chunks within the sliding window may be calculated with the Rabin fingerprint algorithm. If the calculated result satisfies a certain condition, the start of the window would be flagged as the end of a data chunk. Partition of data chunks of a data object is performed by repeatedly sliding the window and calculating data fingerprints. A HASH value is calculated for each data chunk. By comparing HASH values between current data chunks and recorded data chunks, it can be determined whether redundant data chunks exist.
When a data object is processed through a fingerprint algorithm, for example, Rabin fingerprint algorithm, a set of bytes (also called byte string) would theoretically have a unique 64-bit Rabin fingerprint HASH value. When the last 18 bits of an encrypted HASH value are all zero (called residual value), it is believed that a boundary of data chunks in the set of bytes is found, and a set of such corresponding bytes is called “a data chunk.” In other words, when performing 218 times of HASH calculations, a data chunk will generate averagely 256K bytes, i.e., the size of a standard data chunk is 256K bytes. Therefore, a predetermined residue value will indicate the average size of data chunks and a de-duplication ratio that the data de-duplication system can reach. A fingerprint mask may be selected so to be used in searching the residual value of a set of bytes. The fingerprint mask is a random value within a predetermined range.
The prior art attempts to improve the de-duplication ratio through changing the data chunk partition algorithm. However, since data chunk distribution derived for the same data object is unique, the capability of finding redundant chunks is limited. Further, since the repetitive data distribution of a data object generally cannot be known in advance, it would be impossible to devise a data chunk partition algorithm that has a higher de-duplication ratio for various kinds of data objects.
Therefore, it is desirable to provide a novel data de-duplication solution so as to at least partially solve the technical problems existing in the prior art.