Remote replication in storage systems is used to replicate logical volumes of a primary (also called ‘local’) site to a secondary (also called ‘remote’) site. A site can be a storage system or a part of a storage system. The storage system can be a mass storage system capable of storing multiple Terabytes of information.
In asynchronous remote replication, batches of updates are periodically sent to the remote storage site. The batches of updates are performed in cycles (replication cycles).
The content transmitted to the remote storage site at each replication cycle includes differences that occurred in the logical volume to be replicated, since the previous replication cycle. The term “difference” refers to data that was changed (updated, added or deleted) since the previous replication cycle and the respective range of addresses within the logical volume, where the changed data is stored.
Each replication cycle is associated with a point in time. The content of a replication cycle can be calculated by comparing (a) a snapshot of the logical volume at a point in time that is associated with the replication cycle, to (b) a snapshot of the logical volume at a point in time that is associated with a last replication cycle that preceded the replication cycle.
However, the content of the replication cycle may be determined by using other techniques as well.
The local storage system transmits all the content of the replication cycle (the differences) to the remote storage site. Upon successful completion of the replication cycle, after the content of the replication cycle is stored in the replicated volume, a snapshot of the replicated logical volume may also be taken at the remote storage site to reflect a valid replica of the replicated volume and can be used for restoring a compatible and consistent state of the replicated volume, in case of resuming the replication after failure, when the consistency state of the current version of the replicated volume is unknown.
Generally, when the local storage site gets disconnected from the remote storage site in a middle of a replication cycle, either due to communication failure or due to a failure of either of the sites, there is a need to perform a recovery process.
During the recovery process the remote storage site should first revert to a consistent state reflected by a snapshot of a previous replication cycle and the local storage site must transmit to the remote storage site the entire content of the interrupted replication cycle—as it is not known which of the differences were received and stored by the remote storage site.
Reverting to a consistent state at the remote storage site generally involves restoring the last snapshot that was taken before the failure, so as to become the working version of the replicated volume. A restore operation requires involvement of a storage system administrator and is further a time consuming operation that suspends the process of recovery until the snapshot is restored and the remote storage site is ready to receive the recent differences.