1. Field of the Invention
The present invention relates, in general, to data storage and back-up solutions that allow recovery of replicated data, and, more particularly, to software, hardware, and computer systems for providing improved write order control in a data storage system that implements asynchronous replication to provide data protection for a host or primary site with data targets located at a remote or secondary site.
2. Relevant Background
In the data storage industries, the need for effective and reliable backup and archiving of data or information is well known and is becoming increasingly important. The term “backup” generally means that a backup copy of data written to a host or an applications data volume (e.g., writes or changes to the previously stored data) are copied to a remote or secondary site to allow recovery of the data in the case of a failure of the data stored at the host or primary site. This backup operation usually involves a transfer of data to disk storage over a digital communications network, such as to a redundant array of inexpensive disks (RAID) system, and/or to magnetic tape. If the storage resource is thereafter lost or becomes unavailable on account of equipment failure, or for any other reason (e.g., a virus strikes or a personnel error causes a crash), it is then possible to use the backup data to reconstruct the state of the information in host or primary storage.
More specifically, enterprise applications such as file system and database applications often handle large quantities of data, and it is important that the storage of this data be handled and managed such that it can be reliably recovered if a failure causes a crash of the application or a primary data storage device. Data recovery programs or applications are often provided that make the applications and the data storage system crash tolerant. To support such data recovery applications or algorithms, changes to the application data needs to be written in a well-defined order, which may be controlled by the enterprise or host application. If replication is deployed to a remote or secondary site, any changes to the replica or target data or volume needs to be applied in the same order as was done at the host or primary site. Then, the application can reliably recover the data from the replica or copy. If order is not preserved, then data integrity may be compromised and problems or data corruption may occur during attempts to use the inconsistent data. Further, it should be understood that application writes have well defined sequences. For instance, a file system often completes meta data updates before the data is written. Database writes the data and then issues writes to commit. This serialization and parallelism of writes must be maintained to ensure the resulting data pattern on the secondary or remote site is proper. Synchronous replication ensures the write order is maintained by the requesting application as each write is completed at both the primary and secondary site as it is requested. The application cannot proceed to the next write until it has received a write acknowledgement from both sites. However, synchronous replication often results in delays and application inefficiencies due to transfer of data over the network to the remote, secondary site.
Asynchronous replication was introduced to meet the need for longer distance replication at secondary sites that addressed network latencies and allowed the application or host to operate at the speed of local data storage. Asynchronous replication decouples the write to the remote site from the write to the primary site (e.g., to a volume(s) in a local data storage device or to local disk devices), which allows the enterprise or primary application to function at local write speeds. Changes are then picked up from a local queue and asynchronously written to a target at a remote site. In this configuration, the application is able to perform as if replication was not in use, but the problem is that the state of the target or replica of the application data lags behind by one or more writes depending upon the write volume, network latency, and other variables. Decoupling of the source and back up write, therefore, creates the problem that without further control or replication management the write ordering at the remote site is not automatically preserved, which makes recovery problematic.
Typically, write order is preserved in data storage systems that implement asynchronous replication by using serialization at the remote site. For example, in asynchronous replication mode, host or application writes at a primary or host site are issued locally to a data volume in local data storage devices and to an asynchronous (“async”) log or local queue concurrently or in parallel. Host writes are completed to the host only after both of these local writes are completed. Serialization then attempts to preserve the write order on the remote site by requiring that the content of the async log be written at the remote site in the order it was written. Serialization systems have tried to address latency issues, and in some systems, network or link latency is addressed by sending these logged or queued writes to the remote site via the network or link in parallel rather than in order or serially. Then, to guarantee proper write ordering at the remote site, the received and remotely queued writes are issued serially to target volumes or to replicas of the host or enterprise application data. In order to be certain about this ordering, in other words, only one write can be issued at the remote site at a time. As would be expected, when writes are issued serially, performance at the target or remote site may be significantly limited. This is especially true if the remote volume is complex such as being made up of multiple arrays or if there are contentions in the underlying storage array. Serialization at the remote site controls write ordering but results in ongoing lags or delays in updating of replicas, and this causes the replica or copy of application data to have a differing state (i.e., to not reflect all changes to the primary data) that may cause excessive data build up at the primary async log possibly unnecessarily exceeding the capacity.
As a result, existing methods of controlling write order in data storage systems that implement asynchronous replication do not meet the needs of enterprises and others. There is a need for improved techniques of managing writes at primary and secondary sites (or host and remote sites) to control write ordering while better controlling latency between writes at a primary site and a secondary site. Preferably, such techniques would have lower latencies or delays than are presently provided by storage methods that use serialization at the remote site.