1. Technical Field
The present invention relates in general to the field of data deduplication, and more particularly to a method of and system for adaptively selecting an optimum deduplication chunking method for files of a particular type.
2. Description of the Related Art
A goal in data storage is to reduce the amount of space required to store the data. One method of reducing required storage space is data deduplication. In data deduplication, a data object, which may be a file, a data stream, or some other form of data, is broken down into one or more chunks using a chunking method. A hash is calculated for each chunk using any of several known hashing techniques. The hashes of all chunks are compared for duplicates. Duplicate hashes mean either the data chunks are identical or there has been a hash collision. A hash collision occurs when different chunks produce the same hash. To prevent hash collisions, other techniques such as bit-by-bit comparison may be performed. After the comparison of hashes and proof of their uniqueness, unique chunks are stored. Chunks that are duplicates of already stored chunks are not stored; rather, such chunks are referenced by pointers to the already stored chunks.
Data deduplication can yield storage space reductions of 20:1 or more. However, the deduplication ratio is highly dependent upon the method used to chunks the data. Several chunking techniques have been developed. Each chunking method is thought to be optimum for a set of file types. However, a particular chunking method may not in fact be optimum for a particular file type.