The present invention pertains to the field of Disaster Recovery. Disaster Recovery is usually implemented as a protected service operating on a store of data. An external entity (referred to as the Disaster Recovery system) usually transfers the service and the protected data to a place where it can continue operation in the event that a failure occurs in the original service exporting entity. An implementation common in the art protects the data using replication. The service protection is then separately achieved by another system, however the exact mechanism by which this protection operates is not relevant to the present invention.
FIG. 1 illustrates a process for replicating data such as is well known in the art having a primary 501, with its associated data set on permanent storage 502, which is connected to a network 503. Network 503 is routed onto the Internet 504, which is depicted in the figure as a Wide Area Network (WAN Cloud) but which may be any network type, or even where the primary and replica data set may be located within a single computer system and thus not require an external network connection, which ultimately connects to a different network 505. The replica 506 also having a storage device 507 for receiving a replica of the data set and being connected to network 505. Thus, a write to the data set on the primary storage 502 may be encapsulated into a network datagram and sent over networks 503, 504 and 505 where it is received at the replica, unencapsulated and sent down to the replica data set on storage 507. This operation is functionally equivalent to direct replication 508 from the primary data set on 502 to the replica data set on 507.
Such a replication mechanism may form the nucleus of a Disaster Recovery system where a process (of which many exist in the art) transfers the roles of primary and replica when the disaster strikes the primary.
Since the replica is tracking the updates to the primary, there are many times during the operation of replication where data has changed on the primary, but this change has not yet been committed by the replica. These changes are recorded in a log on the primary. If something happens to the primary, the resulting tear down of the replication system may mean that some of the changes never make it to the replica.
Such replication tear downs may simply be transient: the result of interrupted communications (a not uncommon occurrence even with modern networks), or may be fatal: caused by some type of disaster at the primary. In the event of a transient tear down followed by a subsequent restoration of the connection, the primary and replica need some way to transmit just the subset of the data that has changed in the interim (since transmission of the full data set is usually prohibitively expensive).
The method of tracking only the subset of changes most commonly used in the art is that of logging. There are two well known logging types: transaction and intent. In a transaction log, the data that has changed along with its location information is recorded in a time ordered log. When the replica acknowledges that the data is safely stored, its corresponding log entry is erased. In an Intent Log, only the location of the data, not its contents is recorded (the record again being erased when acknowledgement from the replica is received). Intent logs tend to be allocated as fixed size entities (with one dirty indicator for a given unit of data, called a chunk), and the records they maintain are not time ordered. In either case, following the restoration of a previously torn down replication session, the log may be consulted to determine the set of data that needs to be sent to the replica to bring it up to date. Each logging type has benefits and disadvantages; however, this is not pertinent to the present invention and will not be discussed further.
In the event of a fatal tear down of replication, an external disaster recovery mechanism may pick up operation of the service on the replica by reversing the roles of primary and replica. During such operation, the replica becomes the new primary (because it must now alter the data that it merely tracked before in order to maintain the operation of the external service which relies on the said data). Ultimately, the original primary may be restored to service; however, because the service is now being exported from the new primary (the original replica), the data must be synchronised between the new and original primaries before operation of the service may be transferred back to the original primary.
The current state of the art for a process for reversing roles by transferring service back to the original primary is to require a complete transfer of data from the new primary (although the network utilisation may be reduced by comparing cryptographic checksums of the pieces of the data on the primary and the replica to see if they agree rather than blindly transmitting all the data).