The present invention relates to data deduplication, and more specifically, to data deduplication links in separate Hamming circles having a predetermined Hamming link-separation-distance to prevent erroneous data deduplication linking.
Data deduplication typically provides up to a 100:1 reduction of backed-up data by eliminating duplicate copies of data by identifying repetitive storage of identical data. The data deduplication operation identifies the duplicate data, and then replaces the duplicate data by a link which points to the original copy of that duplicate block of data (for block based deduplication) or duplicate file (for file-based deduplication). The data is evaluated by conventional methods to identify duplicate data, such as by hashing or delta differencing. Some conventional hash algorithms used to calculate the Hash code are Message Digit 5 (MD-5), SHA256, etc. Also, the identification of duplicate data may be performed by conducting a cyclical redundancy check (CRC).
There are several problems associated with the conventional methods. One problem is that hash collisions may occur where two different pieces of data have identical hash digests and hence identical links. In this case, conventional methods do not address the possibility of hash digests being different by only 1 or 2 bits, and such “nearly identical” hash digests may present a serious problem when the capability of the error correction code (ECC) can correct more bits than the number of bits by which the hash digests differ. Thus, in the conventional methods, one duplication link may erroneously point to the wrong parent data, thereby causing a subsequent loss of customer data.
There is a desire to provide a method for data deduplication which prevents the problems mentioned-above associated with the conventional methods.