The present invention relates generally to the art of synchronising copies of data in real time between dispersed computer systems. The technique is applicable whether the computers are dispersed over large distances, replicating over short local networks, or even to a single computer utilising a RAID-1 subsystem.
In the field of reliable computing, one particular need is to keep a duplicate or backup copy (called a replica) of a particular data set in such a way that the replica is always an exact (or just slightly out of date) copy of the primary data set. In a synchronous replication environment, the replica is always exact; in an asynchronous one, the replica may be out of date with respect to the primary by at most a pre-determined amount.
FIG. 1 illustrates a replication set up having a primary 101, with its associated data set on permanent storage 102, which is connected to a network 103. Network 103 is routed onto the internet 104 which ultimately connects to a different network 105. The replica 106 also having a storage device 107 for receiving a replica of the data set and being connected to network 105. Thus, a write to the data set on the primary storage 102 may be encapsulated into a network datagram and sent over networks 103, 104 and 105 where it is received at the replica, unencapsulated and sent down to the replica data set on storage 107. This operation is functionally equivalent to direct replication 108 from the primary data set on 102 to the secondary data set on 107.
When a communications failure occurs between the primary and its replica, the primary continues processing the data set but the replica is frozen until it can re-establish communications with the primary. When communications are re-established, the replica must be brought up to date again (preferably without too much impact to the primary). Since the data set being processed may be much greater than the data actually changed while the primary and replica lost contact, it is advantageous to transmit to the replica only the said changes necessary to bring it up to date with respect to the primary. There are two methods usually used for keeping track of the changes between a primary and its replica.
i. Transaction Logging
Every write made to the primary data set is recorded separately in an ordered log called the transaction log whose contents alone are sufficient to recreate the original write. The same write is also sent (if possible) to the replica. When the write completes on the replica, a signal is sent back to the primary and the primary then removes the log entry for the write from its transaction log (note, this removal of the log entry doesn't have to be done instantaneously). When contact with the replica is lost, the transaction log fills up because no completion signals are received. As soon as contact is restored, the transaction log can be replayed, in order, to the secondary (from oldest to youngest write); transactions may still be accepted while the replay is going on. The great advantage of using a transaction log is that while the log replay is in process, because it is sending every write in the correct order, the replica is always an exact (but out of date) copy of the primary. Thus, even if there is a failure during the log replay, the replica would still be usable as an (out of date) copy of the primary. The great disadvantage of a transaction log is that it must be a finite size. Further, since it records every transaction, three writes of a kilobyte each would occupy over three kilobytes in the transaction log since they must all be recorded in the correct order. Thus, a transaction log grows without bound when the primary is processing data but out of contact with the secondary. When a transaction log runs out of space, a condition called log overflow, the primary has no choice but to send its entire data set to the replica when contact is re-established. This may take a considerable amount of time and further, the replica is a corrupt copy of the primary until the resynchronisation is completed.
Obviously, the operator of the replication system must set the transaction log to a maximum finite size, so sizing the transaction log to avoid log overflow in most situations becomes an issue.
ii. Intent Logging
The concept of an intent log is predicated on the assumption that the dataset can be segmented into chunks (called clusters), which the replication system must also use as its basic unit of transmission. Often, the clusters correspond to indivisible quantities in the data set such as the file system block size, or the underlying disc sector size. When a write occurs on the primary, the number of clusters it covers is ascertained and these clusters are marked as dirty in a bitmap of clusters covering the entire data set (the intent log). When the write completion signal is sent back from the replica, the dirty bit is cleared from the intent log (this bit clearing doesn't have to be done instantaneously). If contact is lost with the replica, the intent log continues to keep a record of all the clusters dirtied. When contact is restored the log is replayed sending only the dirty clusters to the replica to update it. However, since the intent log contains no concept of ordering, the replica is corrupt until the replay is complete. The great advantage an intent log has is that it is a finite and known size and can never overflow. This property makes it particularly effective for geographically dispersed clusters, where the communications failures can be prolonged because of the distances and the sheer numbers of intermediate networks that must be traversed to copy data between the sites.
Note also that for a large outage an intent log replay is far more efficient than a transaction log since an intent log only replays a dirty cluster once but a transaction log may have to change the same cluster of data many times because it is mirroring exactly how the data changed in the primary volume.