1. Field of the Invention
The present invention relates generally to removing redundant data, and in particular to reducing data transmission for server side data de-duplication.
2. Background Information
De-duplication processes partition data objects into smaller parts (named “chunks”) and retain only the unique chunks in a dictionary (repository) of chunks. To be able to reconstruct the object, a list of hashes (indexes or metadata) of the unique chunks is stored in place of original objects. The list of hashes is customarily ignored in the de-duplication compression ratios reported by various de-duplication product vendors. That is, vendors typically only report the unique chunk data size versus original size.
The list of hashes is relatively larger when smaller chunks are employed. Smaller chunks are more likely to match and can be used to achieve higher compression ratios. Known de-duplication systems try to diminish the significance of index metadata by using large chunk sizes, and therefore, accept lower overall compression ratios. Also, standard compression methods (LZ, Gzip, Compress, Bzip2, etc.) applied to the list of hashes perform poorly.
In order to reduce bandwidth requirements from client to server, (hash-based) data de-duplication has to be performed at the client. Client side data de-duplication has the following: 1) It is difficult to deploy as client side data de-duplication requires tighter integration into existing applications and systems; 2) It is difficult to do direct compare when using hashing methods in client side data de-duplication, and delta differencing requires large local cache which might not be available in a resource-limited client.
When client side data de-duplication is not possible, the alternative is to perform data de-duplication at the server. In server side data de-duplication, data is transmitted before de-duplication in the link from the client to server.