The invention relates generally to methods for synchronizing distributed data sets. Consider a scenario where a large data set, e.g., a database or compiled programming routines, stored in a first memory (or storage device) is duplicated in a second memory (or storage device). This scenario would occur, for example, in a file backup operation wherein a data set is copied onto a magnetic tape or in database replication wherein all or a portion of a database is copied onto a different machine. For discussion purposes, the data set in the first memory or will be referred to as the xe2x80x9coriginal data set,xe2x80x9d and the data set in the second memory will be referred to as the xe2x80x9cremote copy.xe2x80x9d At some point in time, either the original data set or the remote copy or both may be modified. Typically, the amount of data changed is relatively small in comparison to the total size of the original data set. The task then becomes how to synchronize the original data set and the remote copy in an efficient manner.
There are various prior art techniques for synchronizing distributed data sets. One data synchronization technique uses xe2x80x9ctime stampsxe2x80x9d to identify the areas of differences in the data sets prior to transferring data between the data sets. In this technique, a memory space is allocated to hold the time of the last update for every block of data in the data set. Every executable routine that operates on the content of the data block logs the time stamp of the update. During data synchronization, the time stamp is used to determine if the corresponding data block has changed since the last synchronization. If the corresponding data block has changed, data is then transferred between the original data set and the remote copy. In general, the more precisely the areas of differences in the data sets can be identified, the lesser the amount of data transfer required between the original data set and the remote copy. Time stamps allow immediate identification of different data blocks that need to be synchronized, so no additional processing beyond time stamp comparison is needed to perform the data synchronization. Time stamps are typically used in file backups, code recompilation routines, and database replication.
There are several issues to take into consideration when using time stamps to synchronize data. For example, time stamps allow immediate identification of data blocks that have been updated, but do not indicate whether the content of the data blocks actually changed. Thus, unnecessary data transfer between the original data set and the remote copy may be initiated. Time stamps also do not typically provide sufficient granularity for minimal data transfer. For example, in most file backup services, if a one-byte change occurs in a file, the entire file is transferred. Memory overhead can also be a concern when fine granularity time stamping is required. For example, if it is desirable to keep the granularity of data transfer to a field level for a database table, then the number of fields in the table must be doubled to accommodate time stamping. Furthermore, for proper granular time stamping, upfront design of the data set layout and data access routines is required. This means, for example, that unless adequate space is allocated upfront to hold the time stamps and data access routines are programmed to log the time stamps upon the updates, synchronization at a later time may not be possible.
Another technique for synchronizing data uses xe2x80x9cdirty flagsxe2x80x9d to identify modified blocks of data in a data set. Dirty flags are similar to time stamps, except that they usually hold a Boolean value instead of time. For every block of data, a bit of memory space is allocated to hold the value of the dirty flag. The dirty flag reflects whether the data block has been changed since the last synchronization. Dirty flags are used in database replication and transactional processing. Like time stamps, dirty flags allow immediate identification of different data blocks that need to be synchronized. Synchronization techniques using dirty flags also face many of the challenges discussed above for time stamps. In addition, dirty flags are not applicable to situations where the last time of synchronization is ambiguous. This may occur, for example, if more than one remote copy of the original data set exists and each remote copy is synchronized at different points in time.
Another technique for identifying modified data blocks in a data set is xe2x80x9cversion numbers.xe2x80x9d Version numbers are also similar to time stamps. For every block of data, a memory space is allocated to hold the version number of that data block. The version number, usually a string of characters of some fixed length, is changed whenever the data block changes. The version numbers are then used by the synchronization algorithm to determine if the corresponding data block has changed since the last synchronization. Like time stamps, version numbers allow immediate identification of different data blocks that need to be synchronized. Synchronization techniques using version numbers also face many of the challenges discussed above for time stamps, most notable is insufficient granularity for minimal data transfer. Version numbers work well with coarse-grained blocks, especially when the updates are infrequent and/or bandwidth is not an issue. Version numbers are typically used in synchronization operations involving software distribution and source control.
In one aspect, the invention is a method for synchronizing two data sets. In some embodiments, the method for synchronizing two data sets comprises computing a signature for a first data set in a first address space and a signature for a second data set in a second address space using a one-way hash function. The method further includes comparing the signatures for the first and second data sets to determine whether they are identical. If the signatures are not identical, the method further includes identifying an area of difference between the first data set and the second data set and transferring data corresponding to the area of difference between the first data set and the second data set from the first data set to the second data set.
In some embodiments, the method for synchronizing two data sets comprises subdividing a first data set in a first address space and a second data set in a second address space into their respective elementary data blocks. The method further includes computing a signature for each elementary data block using a one-way hash function and storing the signatures of the elementary data blocks in the first data set in a first array and the signatures of the elementary data blocks in the second data set in a second array. The method further includes comparing each signature in the first array to a corresponding signature in the second array to determine whether they are identical and, if they are not identical, transferring the corresponding data block from the first data set to the second data set.
In some embodiments, the method for synchronizing two data sets comprises subdividing a first data set in a first address space and a second data set in a second address space into their respective elementary data blocks. The method further includes computing a signature for each elementary data block using a first one-way hash function and storing the signatures of the elementary data blocks in the first data set in a first array and the signatures of the elementary data blocks in the second data set in a second array. The method further includes computing a signature for the first array and a signature for the second array using a second one-way hash function and comparing the signatures for the first and second arrays to determine whether they are identical. If the signatures for the first and second arrays are not identical, the method further includes identifying the unique signatures in the first and second arrays and transferring the elementary data blocks corresponding to the unique signatures from the first data set to the second data set.
In another aspect, the invention is a data synchronization system which comprises a first agent having access to a first data set in a first address space, a second agent having access to a second data set in a second address space, and an engine which communicates with the first agent and the second agent when activated. In some embodiments, the engine is configured to send a request to the first agent to compute a signature for the first data set in the first address space and a request to the second agent to compute a signature for the second data in the second address space using a one-way hash function. The engine is also configured to transfer the signature for the first data set from the first address space to the second address space and send a request to the second agent to determine whether the signature for the first data set is identical to the signature for the second data set. The engine is also configured to identify an area of difference between the first data set and the second data set in collaboration with the first and second agents if the signatures of the data sets are not identical and, upon identifying the area of difference between the data sets, transfer data corresponding to the area of difference between the data sets from the first address space to the second address space and copy the data into the second data set.
In some embodiments, the engine is configured to send a request to the first agent to subdivide the first data set into elementary data blocks, compute a signature for each elementary data block using a one-way hash function, and store the signatures of the elementary data blocks in a first array. The engine is also configured to send a request to the second agent to subdivide the second data set into elementary data blocks, compute a signature for each elementary block using the one-way hash function, and store the signatures of the elementary data blocks in a second array. The engine is also configured to transfer the first array from the first address space to the second address space and send a request to the second agent to compare each signature in the first array to a corresponding signature in the second array to determine whether they are identical and, if they are not identical, transfer the corresponding data block from the first data set to the second data set.
In some embodiments, the engine is configured to send a request to the first agent to subdivide the first data set into elementary data blocks, compute a signature for each elementary data block using a first one-way hash function, store the signatures of the elementary data blocks in a first array, and compute a signature for the first array using a second one-way hash function. The engine is also configured to send a request to the second agent to subdivide the second data set into elementary data blocks, compute a signature for each elementary block using the first one-way hash function, store the signatures of the elementary data blocks in a second array, and compute a signature for the second array using the second one-way hash function. The engine is also configured to transfer the signature for the first array from the first address space to the second address space and send a request to the second agent to determine whether the signature for the first array is identical to the signature for the second array. The engine is also configured to identify an area of difference between the first array and the second array in collaboration with the first and second agents if the signatures of the arrays are not identical and, upon identifying the area of difference between the arrays, transfer data corresponding to the area of difference between the arrays from the first address space to the second address space and copy the data into the second data set.