A replica system stores the same data or a portion of the same data as an originating system. The replica system can be used to recover data when data in the originating system is corrupted or lost. For efficiency of storage, both the replica and the originating system may be deduplicating systems in which in-coming data is broken up into segments, and if the segment is already stored on the system, a reference to the already-stored segment is stored instead of storing the segment again. Deduplication typically results in a substantial (e.g., 10×) reduction in the amount of space required to store data for the system.
When first starting replication from one system to another, if the replica is to store all of the same data as the originating system, then the task is clear: transfer all the data over. This is efficient for a deduplicating system, since only the deduplicated segments and the references that enable file reconstruction need to be sent. However, if the replica is to store only a portion of the data on the originator system, then it is not obvious which of the segments stored need to be sent over to the replica. One simple solution is to run through the list of references to segments for the portion of the data to be stored on the replica and ask the replica system if the referred to segment has already been stored. The segment is then only transmitted in the event that it is not already on the replica system. However, this requires back and forth traffic for each reference in the list and checking by the replica system for each reference of a segment. With deduplication, there may be many times more of such references than there are actual data segments. It would be beneficial to be able seed replication for a portion of data stored on a deduplicated system without generating traffic and checking for each reference of a segment.
An analogous situation exists when copying a portion of the data stored on one deduplicated system to a second deduplicated system on a one-time basis. All of the segments referenced by the portion of the data being copied need to be sent to the second system. However, as above, checking each reference to see if the corresponding segment is to be sent to a second system, can create substantial traffic between the two systems for each reference that needs to be checked. It would be beneficial to be able to copy a portion of data stored on a deduplicated system without generating traffic and checking for each reference of a segment.