Replication is typically employed as part of a data backup and recovery storage strategy and, as such, denotes the movement of data from a source storage space of a source domain to a target storage space of a target domain via a communications network (e.g., a computer network) in such a way that enables recovery of applications from the target storage space. As used herein, recovery denotes loading of the applications on possibly different hosts (e.g., computers) where they can access the target storage space, instead of the source storage space, resulting in the applications loaded to a valid state. Also, storage space denotes any storage medium having addresses that enable data to be accessed in a stable way and, as such, may apply to file system access, block access and any other storage access means.
The source domain contains at least the source storage space, but may also contain the hosts, a switching fabric and any source replication components situated outside of those components. In this context, a component may either be a physical entity (e.g., a special replication appliance) and/or software entity (e.g., a device driver). In remote disaster recovery, for example, the source domain includes an entire geographical site, but may likewise span multiple geographical sites. The target domain includes all of the remaining components relevant for replication services, including the target storage space. In addition, a replication facility includes components that may be located in both the source and target domains.
The replication facility typically has at least one component, i.e., a write interception component, which intercepts storage requests (e.g., write operations or “writes”) issued by a host to the source storage space, prior to sending the intercepted writes to the target storage space. The write interception component is typically embedded within a computing unit configured as a source replication node. When issuing a write, an application executing on the host specifies an address on the storage space, as well as the contents (i.e., write data) with which the storage space address is to be set. The write interception component may be implemented in various locations in the source domain depending on the actual replication service; such implementations may include, e.g., a device driver in the host, logic in the switching fabric, and a component within the source domain, e.g., a source storage system. The write interception component is typically located “in-band”, e.g., between the host and the source storage system, although there are environments in which the component may be located “out-of-band”, where a separate physical component, such as an appliance server, in the source domain receives duplicate writes by utilizing, e.g., an in-band splitter.
Synchronous replication is a replication service wherein a write is not acknowledged until the write data associated with the write is processed by the source storage space, propagated to the target domain and persistently stored on the target storage space. An advantage of synchronous replication is the currency of the target domain data; that is, at any point in time, the writes stored on the target domain are identical to the source domain. However a disadvantage of this replication service is the latency or propagation delay associated with communicating the writes to the target domain, which limits the synchronous replication service in terms of distance, performance and scalability.
An asynchronous replication service reduces such latency by requiring that the write only be processed by the source storage space without having to wait for persistent storage of the write on the target storage space. In other words, the write is acknowledged once its associated write data is processed by the source storage space; afterwards, the write (and write data) are propagated to the target domain. Thus, this replication service is not limited by distance, performance or scalability and, therefore, is often preferred over synchronous replication services. A disadvantage of the asynchronous replication service, though, is the possibility of incurring data loss should the source storage space fail before the write data has been propagated and stored on the target storage space.
Prior asynchronous replication services may be classified into a plurality of techniques or styles, one of which is group stamping. According to this replication style, the write interception component intercepts all writes (e.g., synchronously before an acknowledgement is returned to the application) and buffers the intercepted writes. Instead of attempting to establish a relative order among all the writes, the group stamping style service establishes an interval, e.g., either in time or by trigger, and all writes that are intercepted by the write interception component within the interval are recorded to a current group of writes. Notably, the current group is defined by buffering writes during the established interval and associating metadata with the entire group without the need to associate the metadata with each write. The metadata may be an actual timestamp or, more likely, a timeless ordering mechanism (e.g., a sequence number).
Thereafter, according to a predetermined policy or other conditions, the write interception component declares the current group completed and records all subsequent writes to a newly established group. The current group of writes is propagated to the target domain and persistently buffered therein prior to being applied to the target storage space. The group stamping style is typically employed by asynchronous replication services because of its lack of concern with the actual order of writes within an interval; group stamping is generally only concerned with the fact that the writes belong to a same interval.
The replication services may be further adapted to planned recovery or unplanned recovery. Planned recovery is defined herein as an act of recovery where components, e.g., hardware and software, of the source domain are fully operational, whereas unplanned recovery is defined as recovery that takes place when the source components are fully and/or partially non-operational. As used herein, the source domain describes all of the components whose failure/unavailability should not impair the ability to do unplanned recovery.
For unplanned recovery services utilizing the group stamping style, an entire group of writes is propagated to the target domain for storage on the target storage space in a manner that ensures consistency in light of an intervening disaster. For example, the writes are propagated to an intermediate staging area on the target domain to ensure that the target storage space can be “rolled back” to a consistent state if a disaster occurs. The replication services may utilize various intermediate staging areas (such as a persistent log or non-volatile memory) to buffer the writes in a safe and reliable manner on the target domain. In some cases, the intermediate staging area is the target storage space itself and consistent snapshots of, e.g., target volumes of the storage space are generated. In the event of a disaster, a snapshot of the target volume(s) is used rather than the “current” content of the target volume(s).
Assume a group stamping replication service utilizes one write interception component. A first interval is started and a first group of writes is intercepted and logged by the write interception component until the first interval completes. A second interval is then started and a second group of writes is intercepted and logged by the interception component. Meanwhile, the component propagates the first group of writes to a target storage system of a target domain. Where there are two or more writes directed to the same block (address) within the same interval, the write interception component may remove the duplication and send only the most up-to-date write to the target domain (in accordance with a data reduction replication technique). However, if a replication service is implemented that does not reduce such duplication, the write interception component propagates the writes to the target domain in the respective order using, for example, an in-order log or journal on the source domain.
A disadvantage of group stamping is that the achievable Recovery Point Objective (RPO) in the case of disaster may never approach zero because of the delay incurred by the writes at the interception component as a result of the interval. As used herein, RPO is defined as the difference (in time) between the time of a disaster and the time at which the source storage space contained a crash image established at the target storage space. For example, assume the smallest interval of a group stamping style replication service is 10 minutes. If a disaster occurs, the target domain is, on average, 5 minutes behind because the disaster does not necessarily occur exactly before the interval completes. Note that it may be impractical to develop a group stamping replication solution with very small intervals.
Often, a source domain configuration having multiple hosts and/or multiple source storage systems may include only one source replication node (i.e., one write interception component) configured to intercept all writes associated with a consistency group. As used herein, a consistency group comprises storage space that requires consistent replication at a target domain. Such a configuration introduces a scalability issue because there is a limit to the processing bandwidth that the interception component can sustain, thereby resulting in potentially substantial adverse impact on performance of the entire configuration. Thus, this configuration may obviate use of a single write interception component.
For example, assume that a large data center is configured with many source storage systems configured to serve many hosts, wherein the source storage systems cooperate to maintain a consistency group. If all write traffic is directed to the single write interception component, a substantial scalability issue arises because the interception component will not practically be able to sustain the entire traffic. Now assume that a consistency group is configured to span multiple geographical site locations such as, e.g., among several small data centers geographically dispersed throughout a country or a plurality of countries. Here, the main reason for not using a single write interception component is not necessarily the scalability issue as much as the substantial latency introduced by such a configuration. This may necessitate either use of smaller consistency groups, which facilitates reliable and consistent group recovery on the target domain, or acceptance of large latencies and performance impact, which is undesirable. Therefore, such configurations may dictate the use of multiple write interception components.
A prior solution provides consistent replication services using group stamping across multiple write interception components through coordination among all write interception components. Here, a coordinator is provided that sends a predetermined message (e.g., a freeze message) to all write interception components when it is desired to complete a previous interval N. Note that the components accumulate writes in a journal, and process (and acknowledge) those writes beginning at the start of the previous interval N. Upon receiving the freeze message, a write interception component “quiesces” all new write activity by, e.g., buffering any new incoming writes without processing or acknowledging those writes. The coordinator then waits until all write interception components respond with freeze acknowledgements. Once the freeze acknowledgments are received from all the write interception components, the coordinator sends a thaw message to each component to thereby start a new interval N+1. In response, the new, buffered incoming writes are processed by the write interception components as part of the new interval.
The writes of interval N are then propagated from each write interception component to the target domain. Depending on the actual implementation, the writes of interval N may be differentiated among the components such as, e.g., interval N1 from write interception component 1, interval N2 from write interception component 2, etc. Only after all of the writes of interval N are propagated from all of the write interception components to the target domain is the target domain allowed to start applying them to the target storage space. In order to perform consistent group stamping, the write interception components are typically architected in “shared-nothing” relationships (i.e., between write interception components and storage) to obviate crossing of writes received at different write interception components.
A disadvantage of group stamping across multiple write interception components is that the quiescent penalty is substantial in terms of performance. In particular, the freeze-thaw protocol exchange between a coordinator and a plurality of write interception components is not scalable; i.e., the weakest/slowest point in the coordinator-component interchange sequence dominates. This is because the coordinator has to wait to receive every acknowledgement from every write interception component before the previous interval can conclude and a new interval can start, thereby imposing a potential global penalty for all writes issued by the hosts to the source storage systems. This disadvantage may require placing of restrictions on the locations of the write interception components. For example, the group stamping style approach may be reasonable if the write interception components are placed inside of the source storage systems because (i) there are fewer of these systems then hosts, (ii) the source storage systems are typically not located far from each other and (iii) such an arrangement enables more control over the behavior of the components.
Yet another problem that limits scalability of group stamping across multiple write interception components is when a write interception component does not respond to the freeze-thaw protocol. In such a situation, the coordinator is stalled and cannot progress until all acknowledgments are received from all components. In addition, implementation of recovery procedures associated with such a situation (such as timeouts, etc) may be complex.