The present invention relates generally to updating files and streams using a patch, and to compression of files and streams.
Software application programs commonly are subject to frequent revision and modification, such as for bug fixes and to add new features. It is inconvenient to users and expensive to software venders, however, to distribute and install complete new copies of software application programs each time there is a revision. A more effective alternative is to generate a patch file that details various specific deletions and insertions that will update a previous version copy of the software application program to produce a new version copy. Such a patch file generally is much smaller in size than the complete copy of the new version software application program, and can be more easily and less expensively distributed to and installed on the user""s computers, such as via a download from the Internet and use of an automated patch application program.
Data compression also is commonly used to facilitate distribution of files where there is no prior version copy available (e.g., distribution of first version software or data files, or an updated version to new users). One broad classification of data compressors are called xe2x80x9cdictionary encoders.xe2x80x9d (See, e.g., Bell et al., Text Compression 206-243 (1990).) Dictionary encoders encode an input data stream (e.g., a file that is input into the dictionary encoder to be compressed) with respect to some reference data (referred to as the xe2x80x9cdictionaryxe2x80x9d), by matching substrings of the input data stream to strings in the dictionary and encoding the input data stream substrings as references to the matching strings in the dictionary. In one elementary implementation of the dictionary encoder, the dictionary is a compilation in table form of the most frequently occurring strings in the input data stream. The encoded references are simply the index of the matching string in the table. Although the dictionary must typically be included with the resulting compressed data to allow decompression, compression is achieved because the encoded index of a dictionary string is much smaller than its raw or plain text representation.
A well known subclass of dictionary encoder, generally termed LZ encoders (from originating work done by Lempel and Ziv), compresses an input data stream with respect to a preceding portion of the input data stream known as the xe2x80x9chistory.xe2x80x9d The data stream is thus effectively its own xe2x80x9cdictionary.xe2x80x9d In general, an exemplary LZ encoder searches the history for a string that matches each next portion of the input data stream. If such a match is found, the LZ encoder encodes the matched next portion of the input data stream using a match code that is a reference (e.g., composed of an offset and length) to the matching string in the history. Otherwise, the LZ encoder encodes a next character of the input data stream as a raw data code or xe2x80x9cliteralxe2x80x9d that designates the character as plain text. The just encoded portion of the input data stream is then added to the history, and will be included in the search to match the next portion of the input data stream. Often, the history is stored in a fixed size, sliding window type history store, from which the oldest data exits as new data from the input data stream is added (thus, the data effectively slides through the window). Accordingly, with these prior LZ encoders, an input data stream is encoded with respect to preceding data in that same input data stream. These LZ encoders achieve compression of the input data stream because the match codes can be much smaller than the substrings that they represent.
In Sliger et al., xe2x80x9cMethod And System For Updating Software With Smaller Patch Files,xe2x80x9d U.S. patent application Ser. No. 09/093,591, filed Jun. 8, 1998 (the disclosure of which is incorporated herein by reference) (hereafter referred to as the xe2x80x9cPatch Patent Applicationxe2x80x9d), a novel patch system is described that uses a data compressor to generate a patch file for updating a prior version file to a new version file. In one implementation, the history store (which may be the sliding window type) of an LZ encoder is pre-loaded with a prior version of a file. The LZ encoder then commences to compress a new version of the file with respect to the prior version of the file held in the history store. In most cases, the new version of the file will contain much data that is carried over from the prior version. The new version therefore will usually contain many substrings that match substrings of the prior version held in the history store, resulting in highly compressed encoded data. The encoded data is used as a patch file that can be distributed to other computers on which the prior version of the file resides. On such other computers, the new version of the file is reproduced from the patch file using an LZ decoder whose history store also is preloaded with the resident prior version of the file.
A problem common to the above described systems is that the efficacy of both patch generation and compression depends on the degree of similarity between the new and old version files (or between the input data stream and history store). For example, one factor that tends to increase the size of patch files produced by the patch system in the Patch Patent Application is code movement between versions of an executable file. Code movement signifies that some sections or blocks of code within the executable file are moved relative to their locations within the previous version file. Often, there are many instructions (e.g., branch, jump, and function call instructions) in the code of an executable file that reference locations in other code blocks of the executable file as either an offset or absolute location within the file. Such inter-code block references will change if the referenced code block is moved in a new version of the executable file with respect to its location in a prior version. Since program files are a concatenation of many blocks of code, code movement occurs due to the intentional insertion or deletions of data anywhere in the file, and the corresponding changes to location references affect many blocks of code which are otherwise unchanged. When the new version is encoded by the LZ encoder with respect to the previous version, these changed inter-code block references can interrupt matches from code blocks in the new version to otherwise identical code blocks in the prior version of the executable file. As a result, the code block is encoded as several match codes interspersed with literals that encode the changed inter-code block references, instead of as a single match code.
With reference to FIG. 2 for example, a first code block (A) in an old version of an executable file contains two instructions that reference locations in a second code block (B) by offsets O and P. However, the code block B in a new version of the executable file is shifted relative to its location in the old version, such as may be caused from insertion of new data in between the code blocks A and B. Due to the movement of code block B, the two instructions in code block A of the new version file specify changed offsets Oxe2x80x2 and Pxe2x80x2 into the code block B. If the old version file is then loaded into the history store of an LZ encoder as described in the Patch Patent Application, the LZ encoder cannot encode the code block A in the new version as a single match code designating the code block A of the old version held in the history store because the offsets differ between the versions effectively interrupting a continuous match. Instead, the LZ encoder at best (assuming no other changes to code block A) encodes the new version""s code block A as three match codes (designating corresponding portions of the old version""s code block A that are identical) separated by literals that specify the changed offsets Oxe2x80x2 and Pxe2x80x2 in plain text. By causing the offsets in code block A to change, the movement of code block B thus substantially enlarges the size of the patch produced by the above-described process.
The present invention addresses the code movement problem and like obstacles to effective patch generation and compression, by preprocessing a known data stream to be used in patch generation or compression so as to more closely resemble the input data stream. A patch generator or data compressor can produce a smaller patch or compressed data stream since more and/or longer matches between substrings of the input data stream and the preprocessed known data can be found. The known data is identically preprocessed before patch application or decompression to exactly recover the input data stream.
In some implementations, the preprocessing can take the form of predetermined transformations based on data descriptive of relationships between the input data stream and the known data. This descriptive data can include, for example, data describing code movements in an executable file. The transformations based on such descriptive data may include fixing up call instruction references (such as may be identified from import tables, export tables, or relocation tables that are included in the executable file according to many typical executable file formats, or by simply scanning the input data for instruction codes or data which appear to signify position-sensitive values) to account for the code movements indicated by the descriptive data. The descriptive data is stored or sent with the compressed data stream and the same transformations applied deterministically based on the descriptive data at both compression (or patch generation) and decompression (or patch application). The preprocessed executable file likely will not function correctly, as many of its internal references have been mangled, but it will serve as a closer approximation of the new file.
Desirably, the preprocessing based on the descriptive data effects a reduction in size of the patch or compressed data stream. In particular, the combined size of the descriptive data plus the patch or compressed data stream should be smaller than the size of the patch or compressed data stream without preprocessing. Some elements of the descriptive data, however, may not lead to any further reduction in size of the patch or compressed data stream, and may even increase their size. For example, descriptive data of code block movement that does not affect any call instruction references may not produce any further reduction in size of the patch or compressed data stream from the preprocessing. According to another aspect of the invention, such ineffective elements of the descriptive data are discarded. The ineffective elements can be identified, for example, through an iterative process of patch generation or compression after preprocessing with and without individual of the descriptive data elements. Also, the descriptive data is pruned to eliminate duplicative elements.
According to a further aspect of the invention, the preprocessing to be applied to the known data can be determined from iteratively, (1) examining the compressed data produced by a data compressor to identify code movements or other relationships between the input data stream and the known data, (2) producing instructions or descriptive data according to the identified code movements or relationships, (3) preprocessing the known data per the instructions or descriptive data, (4) compressing the input data stream with respect to the preprocessed known data, and (5) comparing the compression achieved to that previously achieved. If the instructions or descriptive data generated in each iteration resulted in improved compression, then the instructions or descriptive data is kept. Otherwise, the instructions or descriptive data are removed as being non-productive.
In at least some implementations of the present invention, the descriptive data utilized in preprocessing can be based on estimates, guesses or other such error-prone information. This is because the patch generation or compression after preprocessing will compensate by encoding any differences of the input data stream from the preprocessed data, whether the differences reflect error introduced during preprocessing or modifications from updating to a new version. Accordingly when combined with patch generation or compression, all modifications made during the preprocessing are effectively self-correcting.
Additional features and advantages of the invention will be made apparent from the following detailed description of an illustrated embodiment which proceeds with reference to the accompanying drawings.