Data deduplication is a data management technique for eliminating the need for storing duplicate copies of data stored on a data storage medium. Data deduplication involves scanning data units (e.g., data chunks) on a target storage medium, identifying a data chunk with duplicate copies, maintaining a single copy of the data chunk and eliminating the duplicate copies by replacing the duplicate copies with a reference (i.e., a pointer) to the first data chunk. Data deduplication can be performed at the file level in a similar manner as done at the chunk level. Data file deduplication involves calculating a signature for a target data file based on the content of the data file, detecting duplicate copies of the target data file by comparing the signature of the target data file with other files, and deduplicating the duplicate copies.
In a distributed storage system, a data file may be uploaded from a client machine to a storage server. Typically, the client machine first breaks down the large data file into multiple data chunks and then uploads the multiple data chunks individually to the storage server over a network connection. Once the individual data chunks are received by the storage server, the data chunks have to be reconstructed to build the data file. In a rudimentary scenario, the file is reconstructed from the multiple data chunks in a first phase, and then a signature is calculated for the entire file in a second phase. A more efficient option would be to calculate the signatures for the multiple data chunks separately as the data chunks are received by the server and then calculate a signature for the file based on the collective signatures of the multiple data chunks.
The latter option works well in a data storage system in which deduplication is performed across the files uploaded by the same client machine that uses equal-size data chunks for the purpose of uploading all the files. However, in a data storage network where different client machines use different data chunk sizes to upload the files, the signature value calculated for two identical files may not be the same. This is because the size of the data chunk affects the signature value calculated for the data chunk. As such, the signature value for identical files calculated based on the collective signatures of multiple data chunks of different sizes will not be the same. In such a scenario, deduplication cannot be correctly performed, because a signature match will not be detected between two identical files that are uploaded in different-size data chunks.