Disaster recovery (“DR”) is one of the most pressing issues faced by the storage industry today. DR generally refers to solutions for recovering critical data and/or resuming operation of storage systems and other technology infrastructure. Various factors may be considered when architecting a DR solution. Examples of these factors may include Service Level Agreements (“SLA”), meeting a tolerable Recovery Point Objective (“RPO”), and/or meeting a tolerable Recovery Time Objective (“RTO”). Other factors may include affordability, ease, robustness, reliability, and manageability with respect to each particular solution.
A conventional solution for recovering lost data in the event of a disaster is storage replication, in which data is written to multiple storage devices across a computer network. Storage replication may include synchronous replication and asynchronous replication. In synchronous replication, each I/O operation by an application server to a primary storage device is replicated on a secondary storage device before the primary storage device acknowledges the application server. This acknowledgement is made after the I/O operations on both the primary storage device and the secondary storage device are completed. In this way, the primary storage device and the secondary storage device are always “synchronized.” In asynchronous replication, the primary storage device acknowledges the application server upon completing each I/O operation without waiting for the secondary storage device to replicate the I/O operation. The application server can then continue performing additional I/O operations on the primary storage device. The I/O operations completed on the primary storage device may be replicated on the secondary storage device according to a specified replication rate.
While synchronous replication implementations ensure the consistency of data between the primary and the secondary storage devices during normal running times, the synchronization can be severed when either one or both of the storage devices fail or the network connecting the storage devices fails. In such instances, even if both of the storage devices are synchronized, the two storage devices can become out of sync due to the momentary I/O traffic happening from the application server. Now, if any of the storage devices were to continue receiving I/O operations from the application server, then the difference between the two storage devices will keep increasing. This difference, known as a “tab,” may be maintained in the memory of the active storage device so that the other storage device can be synchronized when it becomes available again. This difference may also be persisted on a non-volatile medium, such as disk, to ensure that this tab information is not lost across power failures. This difference stored on disk, known as the “gate,” is persisted based on a write-intent logging mechanism that records the intention to perform a write I/O operation to the disk prior to performing it.
The difference information, i.e., the tab and the gate, often includes much more than the differences created after the communication failure between the two storage devices. For example, the difference information may also include a record of all the I/O operations that happened prior to the failure and might have been held on volatile cache memory of either storage device. Since this information, which was previously synchronized but not persisted to the non-volatile media, could be lost due to a power failure in the storage devices, this information may also be tabbed and gated. Thus, the operation of tabbing and gating, while necessary, may often result in excess data traffic during the re-synchronization of the storage devices, thereby wasting bandwidth and processing cycles.
Some implementations of asynchronous replication utilize snapshots, which are point-in-time images of a given storage volume. Snapshots may be taken at a specified snapshot rate on a primary storage device and replicated on a secondary storage device across a network at a specified replication rate. In some cases, the primary storage device and the secondary storage device may have different retention rates, which specify the amount of time that snapshots are stored on the respective storage devices. For example, the secondary storage device may store fewer snapshots than the primary storage device.
During a DR scenario, the primary storage device may revert back to a previous snapshot prior to the failure. In order to synchronize the primary storage device and the secondary storage device, the secondary storage device may also need to revert back to the same snapshot. However, if the primary storage device and the secondary storage device have different retention rates, then the secondary storage device may have already deleted that snapshot. In conventional implementations, the secondary storage device reverts back to the earliest stored snapshot that corresponds to a matching snapshot in the primary storage device. Replication is then repeated from this snapshot forward. In the worst case, the secondary storage device reverts back to a base blank volume, where replication is entirely repeated from the beginning. As such, these conventional implementations can be wasteful in terms of bandwidth, time, cost metrics, and the like.
It is with respect to these and other considerations that the disclosure made herein is presented.