In the information age, the importance of keeping data on-line at all times is becoming paramount. The need for Business Continuance (BC) and fast data recovery is acute and well-acknowledged. One solution to this problem is remote data replication (or remote mirroring). Remote mirroring can avoid or reduce data loss during site-wide disasters. It is also possible to guarantee continuous data access in the presence of site-wide failures by providing hot stand-by hosts and applications at the remote site and directing clients to the remote site when the primary site encounters a failure. Remote data replication comes in two flavors: synchronous and asynchronous. Only synchronous remote mirroring can avoid data loss during site-wide disasters, since a write from a calling application is not considered complete until the data is written successfully to both the local site and the remote site. However, this has a performance penalty on the applications. In asynchronous remote mirroring a write is considered to be completed just after writing to the local site. Subsequently, the updates are sent to the remote site as well. Thus, in a site-wide disaster, there would be data lost if there was some data pending to be sent to the remote site. However, an appliance-based architecture for remote mirroring is gaining popularity as it has the performance of asynchronous mirroring and almost the protection of synchronous mirroring.
In such an architecture, data is stored and accessed by applications running in the primary site. Primary hosts are defined to be the collection of hosts in the site that collectively serve all the I/O requests of the applications. On each one of these primary hosts an intercept agent is installed. These intercept agents collect and report all updates to the staging agent running in an appliance as shown in FIG. 1. The appliance could be connected to the local network (e.g. LAN) or could even be a few miles away (e.g. MAN). It is the duty of the staging agent to receive all updates from all the intercept agents in the primary site, keep them temporarily as local persistent logs which are then sent periodically to the backup agent. The backup agent runs on a remote site. It maintains a copy of the primary site's data by keeping it up-to-date with the updates as and when they are sent by the staging agent. The backup system components in such an architecture refer to the appliance and the remote backup hosts.
Because the appliance is close to the primary site, replication between the primary site and the appliance can be done synchronously without adding significant performance penalty to applications. The replication between the appliance and the remote backup site can be done asynchronously. The staging agent first logs the request received from the intercept agent in a persistent log. The application request can return as long as the request is done successfully on the primary host and is logged in the persistent log in the staging agent. In the background, the staging agent processes the persistent log, and batches multiple updates into large messages before sending them to the backup agent at the remote site. This significantly improves network utilization and hence reduces the overall replication cost. The overall architecture combines the benefits from both synchronous and asynchronous mirroring without adding significant drawbacks. Several vendors have built systems with such an architecture [5, 3]. Under this architecture, the persistent log at the staging agent and the secondary data copy at the remote backup site form a complete replica of the primary site's data copy.
Note that this replication solution does not lose data if the primary site disaster does not affect the appliance. This would be the most probable case if the appliance was on a MAN a few miles away. However, if both the primary site and the appliance face a disaster at the same time, then some data can be lost as the remote site is only updated asynchronously and might be missing some updates. In the worst case, the amount of missing updates is equal to the amount of updates in the persistent log in the appliance. This makes the guarantee of the architecture weaker than the traditional synchronous mirroring guarantee. However, this architecture covers a wide variety of failure cases with much more significant cost and performance advantages than synchronous mirroring. Recent field study shows that only 3% of the failure cases that incurred data loss and system downtime were caused by site-wide disasters [2]. Thus, the appliance based architecture works well for 97% of the failure cases, and even for some fraction of the site disasters that do not involve the given appliance. Hence, it is an attractive alternative to support efficient remote mirroring.
Given this architecture, recovery of the primary host site failures is straight-forward. One can simply switch to the remote site. The remote site must wait for all pending log requests in the appliance to be processed before serving any new requests. However, in the face of the appliance failure, the persistent log may no longer be available, hence some portion of the secondary data copy is lost. Unless some special techniques are used, recovery from the appliance failure can be extremely expensive. In the worst case, the entire data stored at the primary site may have to be copied to the remote site. In certain cases, it may be less costly to compare the primary site's data with the remote site's data and only resynchronize the data that are different between the two than to do a complete data copy. However, comparison itself requires reading of the entire data set at both sites. If checksums of data blocks are used to identify differences, both sites must compute checksums as well. Clearly, not only does this have a significant cost in terms of network bandwidth during data resynchronization, but also it potentially degrades the primary host application's performance for a long time. Furthermore, it may place the overall system in an unprotected mode for a long time if the primary site is not taken offline. Similarly, if the remote backup site encounters a failure, and is recovered from a tertiary backup (probably a tape library), then again the worst case would be to compare the entire primary site's data with the backup site's data, and resynchronize the differences. Assume that there is always a potentially out-of-date backup copy available even after backup system component failures. This is true for the appliance failures, since the data copy at the backup site is an out-of-date backup copy. For the remote backup site failures, assume that there is always a tape backup which can be used to restore the backup site to a certain point of time. To bring this secondary data copy, also called the backup data copy, to the state that is equivalent of the primary data, all updates made at the primary site since the point of time of the backup must be resynchronized. What is needed is a solution which addresses the problem of minimizing this resynchronization time after failures in the appliance and/or the remote backup site.
The potentially long resynchronization time after backup system component failures such as the appliance and the backup site failures is problematic. The long resynchronization time is due to the fact that the primary system does not keep track of what data must be resynchronized when one or more backup system components fail. The simplest and often slowest way of resynchronizing is to compare the entire data sets in both the primary and backup sites exhaustively and apply the differences to the backup. If the amount of data in the primary site is large, this process can be extremely slow. If the primary host knows what data must be resynchronized after failures, then only those data sets need to be recovered from the primary copy to the backup copy. For instance, in the case of the appliance failure, the only data that needs to be recovered is the data that was in the persistent log in the staging agent. Similarly, if the remote site failed. The remote host recovery process can first restore the remote site to the last tape backup. After that, only the updates that have been done since that tape backup must be recovered from the primary site. If such differences can be easily identified, recovery of the backup system components will not be very expensive. In general, the difference between the two versions of data copies can range from a few seconds worth of updates to many hours or even days worth of updates depending on the deployment scenario.
One way to track such differences between the primary data copy and the data on the remote site and the appliance is by using the point-in-time snapshot capability on the primary and the remote sites. The idea is to let the primary hosts take periodic snapshots. The remote site also keeps snapshots but lags behind in the updates. When the appliance fails, the remote hosts can find out the latest snapshot for which they have received all updates. All changes made since that snapshot form a superset of the changes that were in the appliance's persistent log when it died. As long as the primary host has an efficient way of identifying the changes since that snapshot indicated by the remote site, it suffices to just send only those changes to the replica to ensure a complete replica at the remote site. Similarly, if the remote site failed, and it is first restored to the last tape backup. Assuming that the last tape backup corresponds to some snapshot N, then the data that needs to be recovered is the set of the changes made since snapshot N. The primary host can use snapshot information to identify the changed data, and hence recover only a subset of data, instead of the entire volumes of file systems.
Although the above approach significantly reduces the data resynchronization times, it requires the primary host to have appropriate snapshot capabilities, thus creating a system software dependency. The remote backup also needs to be aware of snapshots and capable of utilizing that feature. Further, it imposes a requirement on the snapshot scheduling on the primary site to facilitate quick resynchronization of the backup site or the appliance when a backup component fails. Even with snapshots, the primary site must be able to quickly identify the set of changes made since a given snapshot that the backup is up-to-date with. To facilitate this, the software should avoid a complete snapshot metadata scan as it can be very expensive and performance degrading to host applications. However, schemes that avoid snapshot metadata scan often introduce performance penalties for the primary host data processing. Depending on the complexity of the software, in some cases, the host applications may have to be stopped for the scan to complete. Network Appliance's filers [8] use snapshots for such failure cases. However, such architectures depend on uniformity of software or appliances being used in the primary and the backup sites.
Several methods have been deployed to alleviate the above problem by keeping track of updates in some logs at the primary site in a way other than snapshots. The records in such logs indicate what data has been changed so that only such data needs to be resynchronized when backup system components are recovered. However, even these solutions have a problem because they do not work well with bounded resources. Once the space allocated for the logs is full, these algorithms resort to either forcing all applications to stop generating further updates (thus causing downtime), or they stop accumulating the log and thus exposing the system data loss if the primary site encounters a failure during the long and painful process of comparing the primary and backup versions exhaustively.
Therefore, there remains a need for an efficient resynchronization method to deal with a wide varieties of backup system component failures.