To efficiently update a version of a computer file, it is known to transmit a (smaller) file that describes the differences between the old version and the new version. More specifically, the difference file (or patch file) must describe an algorithm that converts the old file into the new file. Generally, known methods for automatically constructing such patch files have used the old file as a dictionary for encoding the new file, so that there are two parts to the patch file building process: finding substrings in the new file that match some substring of the old file, and then encoding these matches and the mismatched region in some efficient way.
There are three essential features to be considered in the design of such a method: efficiency, effectiveness, and scalability. In this context, efficiency refers to the method's economical use of time, effectiveness refers to the size of the patch files produced by the method, and scalability refers to the economical use of computational resources, particularly random-access memory, as the sizes of the old and new files grow. Generally, for any method, there are tradeoffs among these three design parameters. We will discuss the prior art from these three viewpoints.
First, the problems of patch file construction are similar to the problems of file compression problem. In file compression, one normally uses the earlier portion of the file (head) as a dictionary for encoding the later portion of the file (tail). The difference between file compression and patch file construction is that they use a different dictionary. Thus, it might seem that utilizing some of the well-known file compression techniques, adjusted slightly, might produce good results for patch file construction. The discussion here applies equally to the methods in Welch (U.S. Pat. No. 4,558,302); Waterworth (U.S. Pat. No. 4,701,745); Nagy (U.S. Pat. No. 5,001,478); Katz (U.S. Pat. No. 5,051,745); Seroussi et al. (U.S. Pat. No. 5,389,922); and Mayers et al. (U.S. Pat. No. 5,532,694), as well as the other standard on-line compression methods discussed in Storer, Data Compression, Methods and Theory (Computer Science Press, 1988). There are three problems with this general approach, no matter which of the above compression methods is used:                1. The need for scalability in the compression methods requires some sort of windowing in which the dictionary is limited to “recent” strings (although, in some cases, strings that have been used frequently get moved to a sort of “Hall of Fame” in which they may continue to be used long after their initial appearance). However, in all cases, a string that isn't used at all within a certain window after its initial appearance will never be used subsequently. This causes significant effectiveness problems when these techniques are used in a patch file construction setting without a sufficiently large window. One particular problem occurs when, for example, a database file is re-sorted, maintaining nearly the same data as before, but in a completely different order. If the window is too small, this will cause the patch file construction method to fail to find all relevant matches. Extremely large files (on the order of hundreds of megabytes) tend not to manifest this modification pattern, but moderately large files (tens of megabytes) certainly do.        2. All these methods utilize exact matches in the string-matching portion of the compression process. While this does not cause any effectiveness problem in the compression case, it does cause problems for patch file construction because it fails to take advantage of the pattern of changes present in many updated files, particularly executable files and databases. These files tend to have large blocks that are the same, with the exception of short (one- to four-character) bursts of mismatches, which arise from code that refers to another part of the file that has moved. Much greater effectiveness (experimentally, in some cases, a factor of 4 in patch file size has been observed) can be achieved by using a “sloppy” match method in which some mismatches are allowed in order to increase the length of the matched strings (and thereby allow fewer matches to be encoded). The mismatches must, of course, be encoded separately. This sort of coding is not used in compression methods because there one doesn't typically have this sort of “burst mismatch” phenomenon occurring.        3. All these methods are what are known in the art as “greedy algorithms.” That is, (in this context) matching methods that always use the longest match possible at every stage. Methods that are not “greedy” need an additional pass of the data to make decisions about which match to use at each stage once all the facts are in. There is a good reason for this in the compression case: efficiency. In order to make a compression method sufficiently scalable, one uses the windowing technique mentioned above. One could certain add this optimization to the compression methods, but it seems not to be commonly done in the art.        
Consideration of problem (2) above might lead one to consider the methods derived in the field of “approximate string matching”. In this field, one basically derives efficient methods for computing the “edit distance” or “Levenshtein distance” between two strings, that is, the minimal number of character insertions, deletions or changes that is required to convert one string to the other. Other measures of similarity are occasionally used, but this is the primary one in the literature. The following discussion applies to all of the following references: Finding approximate patterns in strings by E. Ukkonen in J. Algorithms, vol. 6 (1985); Algorithms for approximate string matching by E. Ukkonen in Inform. and Control, vol. 64 (1985); Minimum detour methods for string or sequence comparison by F. Hadlock in Congr. Numer., vol. 61 (1988); On locally optimal alignments in genetic sequences by N. Blum in Lecture Notes in Comput. Sci., vol. 577 (1992); String editing under a combination of constraints by S. Petrović and J. Golić in Inform. Sci., vol. 74 (1993); Approximate string-matching over suffix trees by E. Ukkonen in Lecture Notes in Comput. Sci., vol. 684 (1993); On suboptimal alignments of biological sequences by D. Naor and D. Brutlag in Lecture Notes in Comput. Sci., vol. 684 (1993); O(k) parallel algorithms for approximate string matching by Y. Jiang and A. Wright in Neural Parallel Sci. Comput., vol. 1 (1993); Longest common subsequences by M. Paterson and V. Dancik in Lecture Notes in Comput. Sci., vol. 841 (1994); Parametric optimization of sequence alignment by D. Gusfield, K. Balasubramanian and D. Naor in Algorithmica, vol. 12 (1994); Approximate string matching with don't care characters by T. Akutsu in Inform. Process. Lett., vol 55 (1995); and Fast approximate matching of words against a dictionary by H. Bunke in Computing, vol. 55 (1995).
All of these are quite efficient methods for accomplishing what they set out to do: generally this is to find a portion of a first file (the dictionary) that approximately matches a portion of a second file (the text string). The details of the approximate matching vary from method to method. In some cases, they locate the substring of minimal edit distance; in other cases, they locate the collection of substrings having edit distance within a fixed tolerance of the minimum; in still others, they locate various substrings that are minimal for various choices of weighting in a generalized edit distance. However, they all share two problems relating to their use in patch file building.
In approximate matching, a small edit distance does not directly translate into effective patch file coding. Even using the generalized edit distance approach in which insertions, deletions and substitutions have different costs or weights assigned to them, none of these approaches deals adequately with the sorted database scenario (in which very little of the information has changed, but it has been radically re-ordered) or the very similar problem of an executable file in which the link order was changed. Both of these are quite difficult to deal with from the insertion/deletion/substitution point of view, but the “copy or add” point of view used in a non-windowed compression method deals with these quite handily.
Further, all of these methods have scalability problems. Many of the methods require random-access memory that grows exponentially with the size of the files. None of them can be accomplished, without major modifications, with a fixed amount of random-access memory. These are all theoretical methods that accomplish their goals beautifully, but their design constraints do not necessarily include practicality and scalability.
There is yet another method that is quite familiar to practitioners of the art, usually referred to as “diff” after the Unix command of that name built on this method. This method is discussed in Squibb (U.S. Pat. No. 5,479,654) and is widely known, since the source code to the Unix embodiment of the method is publicly available. The method, developed for use with text files, passes both files simultaneously. When a mismatch occurs, both files are scanned forward, line-by-line, in an attempt to re-align. If realignment occurs, the method proceeds to the next mismatch.
This approach, while completely scalable and quite efficient, has some serious difficulties with effectiveness. First, it does not deal adequately with binary (non-text) files at all. These files are not organized into “lines” and thus the blocking method used by diff fails completely. Second, it does not deal adequately with the burst mismatches so common in executable files. As far as diff is concerned, any mismatch causes the entire block to mismatch (assuming that the blocking problem can be solved somehow). This leads to far too few matches for an effective patch file construction. Third, since it never re-examines previous portions of the old file, it deals inadequately with the sorted database scenario mentioned previously. Fourth, the method is still a “greedy” algorithm.
The first of these problems has been dealt with in a program called “bdiff” (available as part of many newer versions of Unix) which does binary file comparison and patch file construction, but the other three problems have never been adequately resolved in this method: experimentally, bdiff (even when the resulting patch files are compressed) produces much larger patch files than, for example, .RTPatch by Pocket Soft, Inc., shown and described in U.S. Pat. No. 6,526,574 and assigned to the same assignee as the present invention. Its scalability, efficiency, and effectiveness may be assessed by experimental means.
There is another class of patch file construction methods known in the art. These may be collectively referred to as “block-and-hash” methods. In these methods, the old file is divided into (usually fixed-size) blocks and a signature (or hash function) is computed for each block. Then, the same procedure is used on the new file and the hash tables are compared. Any blocks having the same hash value are then examined for matches and the usual “copy or add” encoding is applied to build a patch file. This class of methods includes those in Queen (U.S. Pat. No. 4,807,182); Pyne (U.S. Pat. No. 5,446,888); Squibb (U.S. Pat. No. 5,479,654); and Morris (U.S. Pat. No. 5,574,906). Like “diff,” these methods tend to be completely scalable and quite efficient, but not very effective for many purposes. In some cases, an isolated insertion at the beginning of the new file can cause the method to find no matches at all. In order to overcome this serious difficulty, some methods do away with the blocking on the new file and hash at every position. This, however, degrades the efficiency markedly. Still more problematic, however, is the fact that the burst mismatches so common in executable files cause many matches to go undiscovered by these methods. Finally, they are all “greedy” methods, which, as observed above, inherently lose some effectiveness.
We have mentioned that “greedy” methods are inherently somewhat lower in effectiveness than “optimized” methods. There is a known solution to the optimal coding problem that does apply to this situation, namely, Wagner's algorithm (see Storer, Data Compression, Methods and Theory, Computer Science Press, 1988). This method provides a scheme for optimizing the “copy or add” encoding of a file in terms of any fixed dictionary. To apply Wagner's algorithm to this problem (in conjunction with the approximate matching method mentioned above, to help with the burst mismatches), it would be necessary to locate every approximate match between old and new files and keep track of the number of isolated mismatches within each approximate match and their positions relative to the beginning of the approximate match. With all this information in hand, Wagner's algorithm gives a “most effective possible” solution to the encoding problem. However, there are serious scalability problems here (storage of all the approximate matches as well as each mismatch within each approximate match) as well as efficiency problems (in gathering the data, not in applying Wagner's algorithm, which is itself quite efficient).
Another known method is that described in Jones (U.S. Pat. No. 6,526,574). This method is efficient, effective, and scaleable in terms of mass storage and random-access memory resources, but does not scale well in terms of time. Specifically, its time is quadratic with the size of the new file, so that it is impractical to use with extremely large files (hundreds of megabytes).
We conclude our discussion of the prior art with experimental results on .RTPatch, ver. 3.20 by Pocket Soft, Inc., mentioned above as being experimentally superior in effectiveness to the combination of bdiff and compression. This commercial product is fairly effective and efficient on small files, but has serious scalability problems. In order to work efficiently, it needs twice as much random-access memory as the file size and in fact fails completely on files above 20–30 million characters in size.
Accordingly, several objects and advantages of our invention are that it provide a method that is scalable (all files are accessed sequentially, so they can be located offline; essentially no temporary mass storage resources are needed; random-access memory requirement is constant), effective (patch file size is, in empirical tests, significantly smaller than prior art tested—bdiff, version 3.20 of .RTPatch, version 4.00 of .RTPatch which is based on Jones (U.S. Pat. No. 6,526,574) and efficient (running time is linear in the size of the files, substantially shorter running time than existing commercial products of comparable effectiveness).
This method is also tunable in the sense that the operator can perform an effectiveness/efficiency tradeoff. The method is also amenable to embodiments employing parallel processing to further reduce the time required by the method.
Still further objects and advantages will become apparent from a consideration of the ensuing description and accompanying drawings.