In computing systems and networks, data files are frequently replicated on multiple computers and storage devices, for various purposes. For example, for a given file in a primary storage device, it is often desirable to create a backup of the file and to store the backup file in a separate secondary storage device. The original copy of the file can then be easily recovered in the event the primary storage device becomes inoperable, or if the original copy becomes corrupt or deleted. Accordingly, even in the event of failure, important data can be recovered without significant file reconstruction efforts. Various storage management utilities and services can be utilized for such backup procedures.
In computer networks, replication of files and data can also take place for the purposes of synchronization. A synchronized file is one that exists in two different locations, such as on two different servers for example. By maintaining multiple synchronized copies at multiple locations, not only are alternative copies available in the event of a failure or loss of data, but system efficiency can also be improved. For example, each individual user of the network can access the closest replica of the data, thereby providing quicker access to the data and reducing network traffic.
However, while providing significant advantages, replication of files for such backup or synchronization purposes can require significant bandwidth. Moreover, copying a file from one location to another can require significant processing time and storage space. Accordingly, incremental replication procedures have been utilized where only those files that have been changed since the last backup are replicated. By replicating only the modified files and not the unmodified files, the replication process becomes more efficient.
While incremental replication of modified files can reduce network bandwidth as compared to complete replication of all files, such procedures can still suffer from inefficiency. This is especially the case when only small portions of files have been actually modified, but a copy of the entire modified file is transmitted during the incremental replication. Accordingly, it can be desirable to utilize replication procedures which include differencing mechanisms which identify the differences between the backup (base) version of the original file and the revised version of the original file. The differences can be stored in a delta file, which, in conjunction with the base version, can be utilized to reconstruct the revised version. Thus, only the delta file needs to be transmitted to the replica location during the replication, rather than the entire file. Because the delta file is typically much smaller than the revised file, the transmission of the delta file to the location of the base file can become much more efficient.
Some methods of identifying the differences between the base version of a file and the revised version involve the generation of a base signature file as a function of the data in the base version, as well as the generation of a revised signature file as a function of the data in the revised version. The two signature files and the revised version can then be utilized to generate the delta file reflecting the differences between the base version and the revised version. A delta file can be created in this manner for each subsequent revision to a file. Because each delta file represents the differences between one version and the next, it can be used in either a forward direction, where it is applied to the base version to reconstruct the revised version, or in a backward direction, where it is applied in an opposite manner to the revised version to reconstruct the base version.
The creation of such a signature file for the base version and for the revised version can utilize signature algorithms which operate on the data in the base version and the revised version. For these purposes, signature algorithms can be utilized which operate on the data in the file and result in the creation of values which represent that data. Rather than using the entire file, the signature values can then be processed and handled for the creation of the delta file. These signature values are shorter and therefore easier and faster to transmit and process as compared to the data in the entire file.
In some such methods utilizing signatures, the data in the base version is divided into blocks, and the signature algorithm operates on all of the data in each block to determine the signature value for the block. Likewise, all of the data in the revised version is consecutively processed by a similar signature algorithm to obtain signature values for the revised version. The signature values from the two versions are then compared to identify the similarities and differences between the two versions and to thereby create a delta file identifying the differences between the two. Then, rather than transmitting the revised version, this delta file is then transmitted to the location of the base file to allow for a replication of the revised version, thereby reducing bandwidth requirements.
Accordingly, the use of such signature algorithms to identify differences between files can result in the creation of very accurate delta files which are transmitted to the desired location across the data connection. Such algorithms can also allow for precise reconstruction of the corresponding version of the file without requiring the transmission of an entire file, thus providing a reduction in the amount of data transmitted. However, the use of at least some such signature and differencing algorithms can be computation ally intensive, as they can require sequential processing of the data in the file, even for data that has not changed. Therefore, such processes can be time consuming and have high processing requirements. Moreover, the delta files created by such methods can still require significant bandwidth for transmission and significant memory space for storage, particularly if the differences between the two files are significant.
Accordingly, improved methods and systems are desired for identifying the differences between two versions of a file, and improved methods and systems are desired for replicating a revised version of a file.