Referring now to FIG. 1, an overview of differential compression is shown. Differential compression or delta compression is a way of compressing data with a great deal of similarities. Differential compression produces a delta encoding, a way of representing a version file in terms of an original file plus new information. Thus differential compression algorithms try to efficiently find common data between a reference and a version file to reduce the amount of new information which must be used. By storing the reference file and this delta encoding, the version file can be reconstructed when needed. Alternately, the reference file can be represented in terms of a version file, or each file can be represented as a set of changes of the subsequent version.
Early differential compression algorithms ran in quadratic time or made assumptions about the structure and alignment of the data to improve running time. However, many of the practical applications need differential compression algorithms that are scalable to large inputs and that make no assumptions about the structure or alignment of the data. The main uses of differential compression algorithms are in software distribution systems, web transportation infrastructure, and version control systems, including backups. In systems where there are multiple backup files, a great deal of storage space can be saved by representing each backup file as a set of changes of the original.
The advantages of differential compression are clear in terms of disk space and network transmission. Furthermore, when transmitting information, transmission costs are reduced since only changes need to be sent rather than the entire version file. In client/server backup and restore systems, network traffic is reduced between clients and the server by exchanging delta encodings rather than exchanging whole files[1]. Furthermore, the amount of storage required in a backup server is less than if one were to store the entire file. Software updates can be performed this way as well, which provides the added benefit of providing a level of piracy protection since the reference version is needed to reconstruct the version file. Thus software updates can be performed over the web, since the delta will be of very little use to those who do not have the original.
The differencing problem can be considered in terms of a single version file along with a reference file, or a number of version files along with a single reference file. This application describes the case of a single version file along with a single reference file for simplicity, but the invention can readily be extended to the case of a number of version files along with a single reference file.
Differential compression arose as part of the string to string correction problem[2], which essentially means finding the minimum cost of representing one string in terms of the other. The string to string correction problem goes hand in hand with the longest common subsequence (LCS) problem. Miller and Myers[3] presented a scheme based on dynamic programming and Reichenberger[4] presented a greedy algorithm to optimally solve the string to string correction problem. Both these algorithms ran in quadratic time and/or linear space, which proved to be undesirable for very large files.
The diff utility in UNIX is one of the most widely used differencing tools. However, it is undesirable for most differential compression tasks because it does not find many matching substrings between the reference and version files. Since it does line by line matching, it operates at a very high granularity, and when data becomes misaligned (due to the addition of a few characters as opposed to a few lines), diff fails to find the matching strings following the inserted characters.
There has been much work in the past few years in the area of the copy/insert class of delta algorithms. The present invention falls into this class of techniques, which use a string matching method to find matching substrings between the reference file and version file and then encode these substrings as copy commands, and which encode the new information as an insert command. vcdiff [5], xdelta[6], and the work of Ajtai et al.[7] also fall under the copy/insert class of delta algorithms. The benefit of using this type of method is that if a particular substring in the reference file has been copied to numerous locations in the version file, each copied substring can be encoded as a copy command. Also, if a contiguous part of the reference file substring has been deleted in the version file, the remaining part of the old substring can be represented as two copy commands. There is a benefit to representing the version file as copy commands versus insert commands: copy commands are compact in that they require only the reference file offset and the length of the match. Insert commands require the length as well as the entire new string to be added. Thus for a compact encoding, it is desirable to find the longest matches at every offset of the version file to reduce the number of insert commands and to increase the length of the copy command.
The present invention is compared empirically to those of vcdiff [5], xdelta[6], and the work of Ajtai et al.[7], which seem to be the most competitive differential compression algorithms currently available, in terms of both time and compression. Vcdiff runs in linear-time, while xdelta runs in linear-time and linear-space, although the constant of the linear-space is quite small. Ajtai et al.[7] presented a family of differential compression algorithms where one of their algorithms runs in provably constant-space and (shown empirically) runs in linear-time and offers good compression. However, in some situations, it produces suboptimal compression. They presented three more algorithms that cannot be proven to be linear-time, but produce better compression results.
Ajtai et al. described a greedy algorithm based on the work of Reichenberger[4]. They have used the greedy algorithm for comparison to their schemes and prove that this greedy algorithm provides an optimal encoding in a very simplified model. An optimal solution to the differencing problem depends on the model chosen to quantify the cost of representing the difference file. For a general cost measure, it can be solved in polynomial time by dynamic programming. The greedy algorithm operates by computing hash values on the reference file at every possible offset and storing all such values. Hash values are computed on blocks, where the block size is a predefined parameter. Hash collisions are dealt with by chaining the entries at each value. Next, the version file is continually processed. At each offset of the version file, the greedy algorithm checks all matching hash values in the reference file to see which is the longest match at that location. If there is a match, the longest match is encoded and the version file pointer is updated accordingly. The entire version file is processed in this fashion. The greedy algorithm runs in quadratic-time and linear-space in the worst case. However in highly correlated data, the greedy algorithm shows almost linear behavior[7]. Moreover, it can be shown that the greedy algorithm gives the optimal encoding when the block size is less than or equal to 2. If the block size is greater than 2, the greedy algorithm does not find the optimal delta encoding since it does not find matching substrings of length equal to 2 (or any length less than the block size).
The vcdiff algorithm[5], developed by Tichy, Korn and Vo, combines the string matching technique of the Lempel-Ziv '77[8] and block-move technique of Tichy[9]. The vcdiff algorithm uses the reference file as part of the dictionary to compress the version file. Both Lempel-Ziv algorithm and the block-move algorithm of Tichy run in linear time. vcdiff allows for string matching to be done both within the version file and between the reference and the version file. In vcdiff, a hash table is constructed for fast string matching.
In his masters thesis, Joshua MacDonald describes the xdelta algorithm and the version control system designed to utilize the xdelta algorithms[10, 6]. The xdelta algorithm is a linear-time, linear-space algorithm and was designed to be as fast as possible, despite sub-optimal compressions[10]. It is an approximation of the quadratic-time linear-space greedy algorithms[1]. The xdelta algorithm works by hashing blocks of the reference file and keeping the entire hash table in memory. When there is a hash collision, the existing hash value and location are kept and the current hash value and location are thrown out. The justification for this is to favor earlier, potentially longer matches. Both hash values and locations cannot be kept if the algorithm is to run in linear time, since searching for matches among the duplicate hash values would cause the algorithm to deviate from linear time. At the end of processing the reference file, the fingerprint table is considered populated. The version file is processed by computing the first hash value on a fixed size block. If there is a match in the hash table, the validity of the match is checked by exact string matching and the match is extended as far as possible. Then the version file pointer is updated to the location immediately following the match. If there is no match, then the pointer is incremented by one. The process repeats until the end of the version file is reached. The linear-space of xdelta appears not to be detrimental in practice since the constant is quite small[10]. However, the xdelta algorithm does not find the best possible matches between the reference file and the version file since hash collisions result in new information being lost. Each subsequent hash to the same location is lost and the previous information remains in the hash table.
Ajtai et al.[7] presented four differential compression algorithms, namely the one-pass differencing algorithm, the correcting one-pass algorithm, the correcting 1.5-pass algorithm, and the correcting 1.5-pass algorithm with checkpointing. Their one-pass differencing algorithm was proven to be linear in the worst-case but produces suboptimal compression, since it neither detects transpositions in data nor finds optimal matches at a given location of the version file. Their one-pass algorithm continually processes both the original file and version file sequentially, finding matches by hashing blocks and comparing the blocks. Thus their algorithm will encode the first match it sees and then clear the hash tables. Hence all encoded matches must be in the same sequential order between the reference file and the version file to be detected.
In order to fix these shortcomings, they devised two new methods, corrections and checkpointing. Their method of corrections is a way to improve the match, yet it does not guarantee that they pick the optimal match. It involves implementing a circular queue to store previous hash values of the reference file. Thus it can also cause the existing one pass algorithm to deviate from the worst-case linear-time complexity. The 1.5-pass algorithm works by first hashing footprints in the reference file, but when there is a hash collision it stores only the first offset encountered. Next, the version file is continually processed, by hashing on blocks from the version file. If there is a valid match, then the match is encoded, the version file pointer is incremented and the process continues. The method of checkpointing is used when all possible hash values cannot fit into memory and a subset of these hash values must be selected as an accurate representation of the file. Thus checkpointing implemented along with corrections allows the existing algorithm to improve upon matching substrings already found. This modified algorithm can find longer matches. The work of Ajtai et al. is currently used in IBM's Tivoli Storage Manager product.
There have also been differential compression methods based solely on suffix trees. Weiner[11] proposed a greedy algorithm based on suffix trees that solves the delta encoding problem using linear-time and linear-space. Unlike xdelta, the constant factor of this algorithm is quite large and prevents it from being used on very large files.
More recently, Shapira and Storer presented a differential compression algorithm that works in-place and uses the sliding window method[13]. They showed empirically that the sliding window method effectively tackles the differential compression problem. Their algorithm uses O(max(n, m))+O(1) space where m is the size of the reference file and n is the size of the version file. The O(1) factor comes from the amount of space needed to store the program and a fixed amount of loop variables, etc. They also showed empirically that the limited amount of memory does not hurt the compression performance of their algorithm.