The invention generally relates to the field of data compression. More specifically, the invention relates to techniques, applicable to data which occurs in different versions, for finding differences between the versions.
DIFFERENCING ALGORITHMS AND DELTA FILES
Differencing algorithms compress data by taking advantage of statistical correlations between different versions of the same data sets. Strictly speaking, differencing algorithms achieve compression by finding common sequences between two versions of the same data that can be encoded using a copy reference.
The term xe2x80x9cfilexe2x80x9d will be used to indicate a linear data set to be addressed by a differencing algorithm. Typically, a file is modified one or more times, each modification producing a successive xe2x80x9cversionxe2x80x9d of the file.
While this terminology is conventional, differencing applies more generally to any versioned data and need not be limited to files.
A differencing algorithm is defined as an algorithm that finds and outputs the changes made between two versions of the same file by locating common sequences to be copied, and by finding unique sequences to be added explicitly.
A delta file (xcex94) is the encoding of the output of a differencing algorithm. An algorithm that creates a delta file takes as input two versions of a file, a base file and a version file to be encoded, and outputs a delta file representing the incremental changes made between versions.
Fbase+Fversionxe2x86x92xcex94(base, version)
Reconstruction, the inverse operation, requires the base file and a delta file to rebuild a version.
Fbase+xcex94(base, version)xe2x86x92Fversion
FIG. 1 is an illustration of the process of creating a delta file from a base file and a version file. A base file 2 and a version file 4 are shown schematically, in a linear xe2x80x9cmemory mapxe2x80x9d format. They are lined up parallel to each other for illustrative purposes.
Different versions of a file may be characterized as having sequences of data or content. Some of the sequences are unchanged between the versions, and may be paired up with each other. See, for instance, unchanged sequences 6 and 8. By contrast, a sequence of one version (e.g., a sequence 10 in the base file) may have been changed to a different sequence in the version file (e.g., 12).
One possible encoding of a delta file, shown as 14, consists of a linear array of editing directives. These directives include copy commands, such as 16, which are references to a location in the base file 2 where the same data as that in the version file 4 exists; and further include add commands, such as 18, which are instructions to add data into the version file 4, the add data instruction 18 being followed by the data (e.g., 20) to be added.
In any representation scheme, a differencing algorithm must have found the copies and adds to be encoded. Such other encoding techniques are compatible with the methods to be presented in accordance with the invention.
DIFFERENTIAL ALGORITHMS APPLIED
Several potential applications of version differencing motivate the need for a compact and efficient differencing algorithm. Such an algorithm can be used to distribute software over a low bandwidth network such as a point-to-point modem link or the Internet. Upon releasing a new version of software, the version is differenced with respect to the previous version. With compact versions, a low bandwidth channel can effectively distribute a new release of dynamically self-updating software in the form of a binary patch. This technology has the potential to greatly reduce time to market on a new version, and to ease the distribution of software customizations. For replication in distributed file systems, differencing can reduce by a large factor the amount of information that needs to be updated by transmitting deltas for all of the modified files in the replicated file set.
In distributed file system backup and restore, differential compression would reduce the time to perform file system backup, decrease network traffic during backup and restore, and lessen the storage to maintain a backup image. See U.S. Pat. No. 5,574,906, issued to Robert Morris, titled xe2x80x9cSystem and Method for Reducing Storage Requirement in Backup Subsystems Utilizing Segmented Compression and Differencingxe2x80x9d.
The ""906 patent describes that backup and restore can be limited by both bandwidth on the network, often 10 MB/s, and poor throughput to secondary and tertiary storage devices, often 500 KB/s to tape storage. Since resource limitations frequently make backing up just the changes to a file system infeasible over a single night or even weekend, differential file compression has great potential to alleviate bandwidth problems by using available processor cycles to reduce the amount of data transferred. This technology can be used to provide backup and restore services on a subscription basis over any network including the Internet.
PREVIOUS WORK IN DIFFERENCING
Differencing has its origins in longest common subsequence (LCS) algorithms, and in the string-to-string correction problem. For examples of the former, see A. Apostolico, S. Browne, and C. Guerra, xe2x80x9cFast linear-space computations of longest common subsequencesxe2x80x9d, Theoretical Computer Science, 92(1):3-17, 1992 and Claus Rick, xe2x80x9cA new flexible algorithm for the longest common subsequence problemxe2x80x9d, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching Espoo, Finland , Jul. 5-7, 1995. For an example of the latter, see R. A. Wagner and M. J. Fischer, xe2x80x9cThe string-to-string correction problemxe2x80x9d, Journal of the ACM, 21(1):168-173, January 1973.
Some of the first applications of differencing updated the screens of slow terminals by sending a set of edits to be applied locally rather than retransmitting a screen full of data. Another early application was the UNIX xe2x80x9cdiffxe2x80x9d utility, which used the LCS method to find and output the changes to a text file. diff was useful for source code development and for primitive document control.
LCS algorithms find the longest common sequence between two strings by optimally removing symbols in both files leaving identical and sequential symbols. (A string/substring contains all consecutive symbols between and including its first and last symbol, whereas a sequence/subsequence may omit symbols with respect to the corresponding string.)
While the LCS indicates the sequential commonality between strings, it does not necessarily detect the minimum set of changes. More generally, it has been asserted that string metrics that examine symbols sequentially fail to emphasize the global similarity of two strings. See A. Ehrenfeucht and D. Haussler, xe2x80x9cA new distance metric on strings computable in linear timexe2x80x9d, Discrete Applied Mathematics, 20:191-203, 1988.
In Webb Miller and Eugene W. Myers, xe2x80x9cA file comparison programxe2x80x9d, Softwarexe2x80x94Practice and Experience, 15(11):1025-1040, November 1985, the limitations of LCS are established, with regard to a new file compare program that executes at four times the speed of the diff program while producing significantly smaller deltas.
In Walter F. Tichy, xe2x80x9cThe string-to-string correction problem with block movexe2x80x9d, ACM Transactions on Computer Systems, 2(4), November 1984, the edit distance is shown to be a better metric for the difference of files, and techniques based on this method enhanced the utility and speed of file differencing. The edit distance assigns a cost to edit operations such as xe2x80x9cdelete a symbolxe2x80x9d, xe2x80x9cinsert a symbolxe2x80x9d, and xe2x80x9ccopy a symbolxe2x80x9d. For example, one longest common subsequence between strings xyz and xzy is xy, which neglects the common symbol z. Using the edit distance metric, z may be copied between the two strings producing a smaller change cost than LCS.
In the string-to-string correction problem given in Wagner et al. (supra), an algorithm minimizes the edit distance to minimize the cost of a given string transformation.
In Tichy (supra), the string-to-string correction problem is adapted to file differencing using the concept of block move. Block move allows an algorithm to copy a string of symbols, rather than an individual symbol. The algorithm is then applied to source code revision control package, to create RCS. See Walter F. Tichy, xe2x80x9cRCSxe2x80x94A system for version controlxe2x80x9d, Softwarexe2x80x94Practice and Experience, 15(7):637-654, July 1985.
RCS detects the modified lines in a file, and encodes a delta file by adding these lines and indicating lines to be copied from the base version. This is referred to as xe2x80x9cdifferencing at line granularity.xe2x80x9d The delta file is a line-by-line edit script applied to a base file to convert it to the new version. Although the SCCS version control system (Marc J. Rochkind, xe2x80x9cThe source code control systemxe2x80x9d, IEEE Transactions on Software Engineering, SE-1 (4):364-370, December 1975.) precedes RCS, RCS generates minimal line granularity delta files, and is the definitive previous work in version control.
Source code control has been the major application for differencing. These packages allow authors to store and recall file versions. Software releases may be restored exactly, and changes are recoverable. Version control has also been integrated into a line editor, so that on every change a minimal delta is retained. See Christopher W. Fraser and Eugene W. Myers, xe2x80x9cAn editor for revision controlxe2x80x9d, ACM Transactions on Programming Languages and Systems, 9(2):277-295, April 1987. This allows for an unlimited undo facility without excessive storage.
THE GREEDY ALGORITHM
A well-known class of differencing algorithms may be termed xe2x80x9cgreedyxe2x80x9d algorithms. Greedy algorithms often provide simple solutions to optimization problems by making what appears to be the best decision, i.e., the xe2x80x9cgreedyxe2x80x9d decision, at each step. For differencing files, the greedy algorithm takes the longest match it can find at a given offset on the assumption that this match provides the best compression. It makes a locally optimal decision with the hope that this decision is part of the optimal solution over the input.
A greedy algorithm for file differencing is given by Christoph Reichenberger, xe2x80x9cDelta storage for arbitrary non-text filesxe2x80x9d, Proceedings of the 3rd International Workshop on Software Configuration Management, Trondheim, Norway, Jun. 12-14, 1991, pages 144-152. ACM, June 1991.
For file differencing, the greedy algorithm provides an optimal encoding of a delta file, but it requires time proportional to the product of the sizes of the input files. We present an algorithm which approximates the greedy algorithm in linear time and constant space by finding the match that appears to be the longest without performing exhaustive search for all matching strings.
DELTA COMPRESSION WITH GREEDY TECHNIQUES
Given a base file and another version of the same file, the greedy algorithm for constructing differential files finds and encodes the longest copy in the base file corresponding to the first offset in the version file. After advancing the offset in the version file past the encoded copy, it looks for the longest copy starting at the current offset. If at a given offset, it cannot find a copy, the symbol at this offset is marked to be added and the algorithm advances to the following offset.
Referring now to FIG. 3, the first task the algorithm performs is to construct a hash list and a link list out of the base version of the input files. The hash table allows an algorithm to store or identify the offset of a string with a given footprint. The link list stores the offsets of the footprints, beyond the initial footprint, that hash to the same value. In this example, strings at offset A1, A2, A3, and A4 all have a footprint with value A. The link list effectively performs as a re-hash function for this data structure.
These data structures are assembled, for instance by the function BuildHashTable in FIG. 4.
The algorithm then finds the matching strings in the file. The FindBestMatch function in FIG. 4 hashes the string at the current offset and returns the longest match that contains the string identified by the footprint. The function exhaustively searches through all strings that have matching footprints by fully traversing the link list for the matched hash entry. If the current offset in the version file verFile has footprint A, the function looks up the A-th element in the hash table to find a string with footprint A in the base file. In hashtable[A], we store the offset of the string with a matching footprint. The string at the current offset in the version file is compared with the string at hashtable[A] in the base file. The length of the matching string at these offsets is recorded. The function then moves to linktable[hashtable[A]] to find the next matching string. Each successive string in the link table is compared in turn. The longest matching string with offset copy_start and length copy_length is returned by the function FindBestMatch.
Alternatively, if FindBestMatch finds no matching string, the current offset in the version file (ver_pos) is incremented and the process is repeated. This indicates that the current offset could not be matched in the base version (baseFile) and will therefore be encoded as an add at a later time.
Once the algorithm finds a match for the current offset, the unmatched symbols previous to this match are encoded and output to the delta file, using the EmitAdd function, and the matching strings are output using the EmitCopy function. When all input from verFile has been processed, the algorithm terminates by outputting the end code to the delta file with the EmitEnd function.
ANALYSIS OF GREEDY METHODS
Common strings may be quickly identified by common footprints, the value of a hash function over a fixed length prefix of a string. The greedy algorithm must examine all matching footprints and extend the matches in order to find the longest matching string. The number of matching footprints between the base and version file can grow with respect to the product of the sizes of the input files, i.e. O(Mxc3x97N) for files of size M and N, and the algorithm uses time proportional to the number of matching footprints.
In practice, many files elicit this worst case behavior. In both database files and executable files, binary zeros are stuffed into the file for alignment. This xe2x80x9czero stuffingxe2x80x9d creates frequently occurring common footprints (discussed in detail below) which must all be examined by the algorithm. When a greedy algorithm finds a footprint in a version file, the greedy algorithm compares this footprint to all matching footprints in the base file. This requires it to maintain a canonical listing of all footprints in one file, generally kept by computing and storing a footprint at all string prefix offsets. See, for instance, Reichenberger (supra). Consequently, the algorithm uses memory proportional to the size of the input, O(N), for a size N file.
THE UNMET NEED FOR GENERALIZATION
While line granularity may seem appropriate for source code, the concept of revision control needs to be generalized to include binary files. This allows binary data, such as edited multimedia, binary software releases, database files, etc., to be revised with the same version control and recoverability guarantees as text. Whereas revision control is currently a programmers tool, binary revision control systems will enable the publisher, film maker, and graphic artist to realize the benefits of data versioning. It also enables developers to place image data, resource files, databases and binaries under their revision control system. Some existing version control packages have been modified to handle binary files, but in doing so they impose an arbitrary line structure. This results in delta files that achieve little or no compression as compared to storing the versions uncompressed.
An algorithm for binary differencing exists. See Reichenberger (supra).
While this algorithm handles binary inputs, it often requires time quadratic in the size of the input to execute, time O(Mxc3x97N) for files of size M and N. As a consequence, the algorithm cannot be scaled to operate on arbitrarily large files and consequently cannot be applied to a wide variety of computer applications.
It is an object of this invention to devise a method and apparatus for forming a compressed differentially encoded image of a version file utilizing a base file.
It is a related object that such method and apparatus form the compressed encoded image within a time span linearly proportional to the size of the version and base files.
The foregoing objects are believed satisfied by a machine implementable method for forming a differentially encoded compressed image of a version file also utilizing the base file. The image is defined over a set of file building operations (ADD, COPY, END), length descriptors, and address pointers. In a so-called xe2x80x9cone and one-half passxe2x80x9d rendition, the base file is recursively scanned using a window of m bytes in length shifted in the same direction k less than m bytes/recursion. Significantly, a hash function wignature or fingerprint of the window contents is formed for each base file recursion with the signatures being dutifully written into scanned addressable buffer or the like. Next, the version file is recursively scanned also using a window of m bytes in length shifted in the same direction k less than m bytes/recursion. A hash function or fingerprint of the windows content is formed and immediately compared with the buffer stored signatures of the base file. In the event of a comparison match of the signatures and verification of contents, a difference file is encoded ad seriatim a portion of the version file contents. This is copied from the later of either the start of the version file or the last comparison match up to the point of the instant comparison match. The difference file is further incoded with a COPY command, length attribute, and pointer to the base file location of the matching contents. The steps are repeated until the version file scan is exhausted.
In a so-called xe2x80x9cone passxe2x80x9d rendition of the method of this invention, the base file and version file of recursively scanned respectively in time overlap relation and asynchronously using respective windows of m bytes granularity and k less than m bytes alignment starting from predetermined location is in the respective files. During each recursion, a hash function signature of the respective windows contents is formed. The signature of the version file is then the compared with the signatures of the base file that have been so far formed. Upon a comparison match and verification of contents a point of synchronization is established. This permits the compressed image of the version file to be encoded ad seriatim as before.
Both renditions of the method and means of this invention further include checkpointing to reduce the number of signatures and increase the comparison speed and matching and a backwards extensibility of string matching to undo and then re-do the comparison matching and encoding process to enhance the compression of the differentially encoded image of the version file.
The invention describes a plurality of methods for binary differencing that can be integrated to form algorithms that efficiently compress versioned data. Several algorithms based on these methods are presented. These algorithms can difference any stream of data without a priori knowledge of the format or contents of the input. The algorithms drawn from the invention can difference data at any granularity including operating at the level of a byte or even a bit. Furthermore, these algorithms perform this task using linear run time and a constant amount of space. The algorithms accept arbitrarily large input file without a degradation in the rate of compression. Finally, these methods can be used to produce a steady and reliable stream of data for real time applications.
The invention is disclosed in several parts. Techniques useful to algorithms that generate binary differences are presented and these techniques are then integrated into algorithms to difference versioned data. It is understood that a person of ordinary skill in the art could assemble these techniques into one of many possible algorithms. The methods described as the invention then outline a family of algorithms for binary differencing using a combination of methods drawn from this invention.
While the invention is primarily disclosed as a method, it will be understood by a person of ordinary skill in the art that an apparatus, such as a conventional data processor, including a CPU, memory, I/O, program storage, a connecting bus, and other appropriate components, could be programmed or otherwise designed to facilitate the practice of the method of the invention. Such a processor would include appropriate program means for executing the method of the invention.
Also, an article of manufacture, such as a pre-recorded disk or other similar computer program product, for use with a data processing system, could include a storage medium and program means recorded thereon for directing the data processing system to facilitate the practice of the method of the invention. It will be understood that such apparatus and articles of manufacture also fall within the spirit and scope of the invention.