Deduplication can reduce the size of backup copies of data objects such as files systems, thereby making more efficient use of network resources. Deduplication reduces backup copy size by eliminating duplicated data. Only one unique copy of multiple instances of the same data is actually retained on a backup storage device, such as disk or tape storage. For example, a typical file system might contain 100 instances of the same one megabyte (MB) file. If the file system is copied to backup storage, all 100 instances are saved. With deduplication, only one instance of the file is copied to backup storage; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB backup storage demand could be reduced to one MB. Deduplication can also reduce the throughput requirements of a data communication link (e.g., a wide area network or WAN) coupled to backup storage; less bandwidth is needed for the WAN to transmit one MB to/from backup storage when compared to the requirements needed to transmit 100 MB.
Deduplication generally operates at the file or block level. File deduplication eliminates duplicate files, but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique instances of each block portion of a predetermined size (e.g., 512 bits). If a file is subsequently updated, only the changed data is saved during the next backup operation. That is, if only a few bits of a file are changed, then only the changed blocks are saved, the changes do not constitute an entirely new file. This behavior makes block level deduplication far more efficient. Various embodiments of deduplication will be described herein with reference to block level deduplication of file system backup copies. However, the deduplication described herein may be employed with respect to data objects other than file system backup copies.
Cloud storage services are becoming popular. In the context of enterprise backup, file systems are typically much too large to permit reasonable backup durations, and some form of deduplication may be needed. A cloud storage service might implement deduplication, but this is only a convenience for the cloud storage service provider and does not address critical problems such as throughput to the cloud storage service over a WAN as experienced by a data protection application such as backup/restore.
Data deduplication could be implemented inline between the backup/restore application and the cloud storage service. However, in this embodiment, data passes through inline deduplication to and from the cloud storage service. Inline deduplication may be limiting in several ways. In particular, all involved backup and restore must be accessible from the device (e.g., an appliance) that implements deduplication, and thus a file system cannot be restored absent the inline data deduplication appliance. This can be difficult, especially in the context of restoration after data corruption, given that successful restoration is entirely dependent on proper functioning and data integrity of the inline deduplication appliance.