Disaster recovery (“DR”) is one of the most pressing issues faced by the storage industry today. DR generally refers to solutions for recovering critical data and/or resuming operation of storage systems and other technology infrastructure. Various factors may be considered when architecting a DR solution. Examples of these factors may include Service Level Agreements (“SLA”), meeting a tolerable Recovery Point Objective (“RPO”), and/or meeting a tolerable Recovery Time Objective (“RTO”). Other factors may include affordability, ease, robustness, reliability, and manageability with respect to each particular solution.
RPO generally refers to an acceptable amount of data loss as measured in time relative to when a disaster occurs. More particularly, RPO may represent the point in time from which an entity should be able to recover stored data. For example, if an entity establishes the RPO as four hours, the entity should be able to recover any stored data that exists at least four hours prior to the disaster. In other words, the entity has established that the loss of data less than four hours old is acceptable.
A conventional solution for recovering lost data in the event of a disaster is storage replication, in which data is written to multiple storage devices across a computer network. Storage replication may be performed at a desired replication rate (i.e., the frequency at which data is replicated), and the replication rate may be configured and adjusted in order to satisfy the established RPO. For example, a higher replication rate may correspond to a lower RPO, while a lower replication rate may correspond to a higher RPO. Further, a higher replication rate may result in a higher number of recovery points from which an entity can recover lost data, while a lower replication rate may result in a lower number of recovery points.
Storage replication may include synchronous replication and asynchronous replication. In synchronous replication, when a primary storage device finishes writing a first chunk of data, a secondary storage device must finish writing the first chunk of data before the primary storage device can begin writing a second chunk of data. A drawback with synchronous replication is the latency caused when the primary storage device copies the first chunk of data, transfers the first chunk of data across the computer network, and waits for the secondary storage device to finish writing the first chunk of data.
In asynchronous replication, after the primary storage device finishes writing the first chunk of data, the primary storage device can begin writing the second chunk of data without waiting for the secondary storage device to finish writing the first chunk of data. While asynchronous replication does not experience the latency of synchronous replication, a drawback of asynchronous replication is potential data loss caused when the primary storage device fails before the secondary storage device completes writing the data. This can be particularly troublesome if the secondary storage device suffers any lag caused by high input/output (“I/O”) load in the primary storage device, reduced network link speed, network link failures, and the like. In particular, as a result of the lag, the secondary storage device may not be able to maintain the desired replication rate, and thereby may not be able to satisfy the established RPO.
It is with respect to these and other considerations that the disclosure made herein is presented.