Deduplication can reduce the size of data objects such as backup copies of files systems. Deduplication reduces backup copy size by eliminating duplicated data, and thus only one unique copy of multiple instances of the same data is actually retained on backup storage. For example, a typical file system might contain 100 instances of the same one megabyte (MB) file. If the file system is copied to backup storage, all 100 instances are saved. With deduplication, only one instance of the file is copied to backup storage; each subsequent instance is referenced back to the one saved copy. Thus, a 100 MB backup storage demand could be reduced to one MB. Deduplication can also reduce the throughput requirements of a data communication link coupled to backup storage; less bandwidth is needed for the communication link to transmit one MB to/from backup storage when compared to the requirements needed to transmit 100 MB.
Deduplication generally operates at the file or block level. File deduplication eliminates duplicate files, but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique instances of each block portion of a predetermined size (e.g., 512 bytes). If a file is subsequently updated, only the changed data is saved during the next backup operation. That is, if only a few bits of a file are changed, then only the changed block or blocks are saved. This behavior makes block level deduplication far more efficient. Various embodiments of deduplication will be described herein with reference to block level deduplication of file system backup copies. However, the deduplication described herein may be employed with respect to data objects other than file system backup copies.
Cloud storage services are becoming popular to businesses seeking to reduce their storage costs. A cloud storage service might implement deduplication, but this is only a convenience for the cloud storage service provider and does not address critical problems such as throughput to the cloud storage service over a WAN as experienced by a data protection application such as backup/restore.