Replication is a critical feature for disaster recovery appliances. There are numerous configurations where data are transmitted across the network for disaster recovery purposes: pairs of office protecting each other, satellite offices transmitting to headquarters, and satellite offices transmitting to relay stations that consolidate and then transmit to one or more national data centers. Communication may occur over low bandwidth links because customers are located in inhospitable locations such as offshore or in forests. The goal for disaster recovery purposes is to improve data compression during replication so more data can be protected within a backup window.
The challenge is to transfer all of the logical data (e.g., all files within the retention period) while reducing the transmission as much as possible. Storage appliances achieve high compression by transferring metadata that can reconstruct all of the files based on strong fingerprints of data chunks followed by the unique data chunks. Since there is often a large amount of redundancy within backup data sets, even within modified files, 10× or greater compression can be achieved by only sending unique data chunks. A data chunk or simply chunk is a partition of data used in the deduplication process. Prior to storing a file in a storage, the file is segmented using a chunking algorithm into multiple chunks and only the non-duplicate chunks are stored. A fingerprint of a chunk is used to represent or identify a chunk. A fingerprint of a chunk is generated by hashing content of chunk using a hash function such as SHA-1 or MD5.
FIG. 1 is a block diagram illustrating a conventional method of data replication over a network. Referring to FIG. 1, typically, a source storage system transmits to a target storage system a list of fingerprints presenting data chunks for replication. A fingerprint may be generated by hashing at least a portion of content of a data chunk. The target storage system then determines which of the data chunks that have been stored locally based on the fingerprints. The target storage system then replies with a list of one or more fingerprints representing one or more data chunks that are not stored locally. The source storage system then transmits the missing data chunks to the target storage accordingly. As the deduplication rate improves, the amount of data chunks transmitted over the network can be reduced. However, a significant amount of fingerprints (also referred to as metadata) may still be transmitted over the network. Fingerprints become a larger percentage of the data transmitted over the network.