The present invention relates to deduplication techniques, and more specifically, to a method for deduplication of a file and an apparatus thereof.
Deduplication is a widely used data reduction technique in the process of backing up and archiving data, which eliminates redundant data by deleting duplicate data in data sets (for example, a file), so as to reduce a storage capacity used in the storage space.
Typically, the redundant data can be divided into three types: File Level, Block Level and Byte Level. In the type of File Level, the redundant data is entire files, which means one file is a duplicate of at least one other file. In the type of Block Level, the redundant data is blocks within the file, which means there are identical data blocks among different files. In the type of Byte Level, the redundant data is more fine-grained data in bytes.
There are some deduplication methods for these three types of redundant data. Single Instance Storage is a deduplication technique for the File Level, which can effectively detect duplicate files. When each file is stored onto a content addressable memory (CAS) device, the CAS device will generate a hash for the file content stored therein. If there are already files having an identical hash in the CAS device, the CAS device will recreate a pointer representing a duplicate to point to the already existing files, without saving the duplicate files. The deduplication technique for the Byte Level may effectively eliminate the redundant bytes by data compression technology.
A commonly used deduplication method is a hash-based method. In this method, firstly, the data file is segmented into a set of data blocks, and for each data block, a fingerprint (i.e. a hash value) is calculated. Then the fingerprint is used as a keyword for hash retrieving. If matching, it indicates that the data block is a duplicate data block, and only an index number of the data block is stored. If not matching, it indicates that the data block is a new data block, and the data block will be stored and the associated metadata will be created.
There are three kinds of chunking methods, namely, fixed-size partition (FSP), content-defined chunking (CDC) and sliding-window blocking (SB). The FSP method is to partition a file into data blocks with a fixed length, so as to quickly detect the redundant data. The CDC method and the SB method are to partition a file into data blocks with a variable size based on content of the file, so as to effectively find the redundant data.
However, when the deduplication is performed on a file, if the file has deduplication executed previously and is modified later, for example, new data is inserted into the file, or the original data is deleted from the file, or the data is modified in the file, which changes the length of the file, the FSP method is very sensitive to such modification, and is very inefficient. This is because the FSP method partitions the file into the fixed-size blocks. One or some data blocks obtained by partitioning the modified file will contain the modified data. Compared with the data blocks obtained by partitioning the file before the modification, the contents of the data blocks change and they become the new data blocks. Since the length of the data block is fixed, the contents in the subsequent data blocks also change due to the change in the length of the file, and these data blocks also become the new data blocks. In this way, the redundant data blocks in the file will not be actually detected, so the deduplication rate for the file is reduced.
Although the CDC method and the SB method can solve such a problem, it is relatively difficult to determine the size of the data blocks, which renders the deduplication speed lower than that of the FSP method.