Nowadays it is often necessary to modify software or operating data for executing an application or for operating equipment such as, for example, mobile radio devices or automatic toll detection devices carried on vehicles, in order to eliminate errors thereby, for example, and to implement improved or additional functions (“Software upgrade” or “Software update”).
In particular, in the event that the current version of the software or operating data is to be transmitted via an interface with relatively slow and/or expensive data transmission (for example a radio interface in mobile radio), it may be expedient not to transmit the complete data set but to select a different procedure, in which only the “difference” between the older version and the newer version is transmitted, and subsequently by the transmitted difference the newer version is generated from the older version which is already present.
To this end, the production of a delta-file by a delta-file generator of a first computer (server) is known, for example, from the U.S. Pat. No. 6,401,239. The delta-file which encodes the difference between the first version and the second version of a data file is transmitted to a second computer (client), on which the first version of the data file is stored, the second version of the data file being generated from the first version thereof and the delta-file by a restorer.
A key problem in this regard is the production of a delta-file of the smallest possible size, in order to minimize thereby the necessary transmission time for transmitting the delta-file and/or the resulting transmission costs.
The production of a delta-file is, for example, disclosed in Walter F. Tichy, “The String-to-String Correction Problem with Block Moves”, ACM Trans. on Computer systems, Vol. 2, No. 4, November 1984, pages 309-321.
For explaining the method disclosed in that publication, reference is made to FIG. 1. Accordingly, two different data files, of which one (for example a newer version of a data file) is to be generated by a delta-file from the other (for example an older version of the data file), are acquired as two data strings, denoted S0 and S1. In order to produce a delta-file, initially a set of non-overlapping, maximum (exact) “matches” is sought, i.e. elemental data strings with identical characters and maximum length, which occur both in S0 and in S1, although in different positions of the respective data string.
The result is a sequence of matches and gaps, which is shown for example in FIG. 1 for S1 (above in FIG. 1) and for S0 (below in FIG. 1). Two exactly matching elemental data strings are identified by the two arrows illustrated as an example in FIG. 1.
Subsequently, based on the exactly matching elemental data string found, a delta-file is generated containing a series of data processing operations (or commands) to be carried out successively on the first data file S0, in this case COPY- and ADD-operations, a COPY-operation respectively resulting from the exactly matching data regions and an ADD-operation respectively resulting from the gaps located between the exactly matching data regions.
A COPY-command, which in the example shown typically has the syntax “COPY <length> <offset in S0> <offset in S1>”, copies an exactly matching data region, which is located in the data string S0 at the position <offset in S0> and has the length <length>, according to S1 and namely at the position <offset in S1>. On the other hand, an ADD-command, which typically has the syntax “ADD <length> <data of S1>”, adds data of the length <length>, which has to be incorporated in the delta-file, into the gaps between the exactly matching data regions in S1. In this manner, the second data file S1 may be generated by “processing” the successive sequence of COPY- and ADD-operations from the first data file.
In the method shown, the size of the delta-file is determined by the size of the exactly matching data regions, greater matches reducing the volume of data to be transported in the delta-file. The overlap having maximum matches may be calculated nowadays in a very efficient manner by modern data structures, such as are known for example from bioinformatics for calculating very long DNA-strands. The delta-file produced is typically compressed before transmission and decompressed before the update and/or upgrade process.
The method set forth has the drawback that it may occur that very many data processing operations which are to be successively processed have to be contained in the delta-file, to generate a second data file from a first data file.
This drawback may be eliminated if, instead of exactly matching elemental data regions in the two data files, so-called “pseudo-matches” are used, having data regions which are only approximately the same and/or slightly different from one another.
In this case, in a delta-file for every pseudo-match, the difference of the slightly different elemental data regions is stored in a byte-wise manner, the difference-bytes which result from the exactly matching data of a pseudo-match being zero, whilst the difference-bytes which result from the data which are different from one another of a pseudo-match are generally different from zero. The difference for a pseudo-match substantially consists of zeros (zero bytes) and is thus highly compressible.
Based on the pseudo-matching elemental data strings ascertained, a delta-file is subsequently generated which defines a series of data processing operations or commands to be carried out successively on the first data file, namely DIFF- and ADD-operations, a DIFF-operation respectively resulting from the approximately matching data regions and an ADD-command respectively resulting from the gaps located therebetween.
A DIFF-command, by byte-wise addition of its bytes to the corresponding values in the first data file in the presence of a zero byte in the difference, copies data of the first data file (from the position of the first data file corresponding to the zero byte, to the corresponding position of the second data file) and in the presence of a non-zero byte copies the sum of the byte of the first data file and the data from the DIFF-command to the corresponding position in the second data file.
In contrast to delta-files which are based on exactly matching data regions (matches), by using only approximately matching data regions (pseudo-matches) the number of pseudo-matches and thus the number of data processing operations to be transmitted in the delta-file may be markedly reduced.
The use of pseudo-matches may be derived from the program code of the freely available software tool, “bsdiff”, about which however nothing further is documented.
In principle, with the use of pseudo-matches a trade-off has to be made between a few long pseudo-matches with many non-zero bytes and many short pseudo-matches with few non-zero bytes. Whilst the first-mentioned case is worse for data compression, it is advantageous with regard to the lower number of data processing commands. On the other hand, the last-mentioned case is better for data compression, whilst it is disadvantageous with regard to the greater number of data processing commands.
Accordingly, with the production of delta-files based on pseudo-matches an optimization problem is present, which concerns the selection of the matches to be used and the combining thereof to form pseudo-matches. In the above-mentioned software tool, bsdiff, a heuristic is used to this end which is barely decipherable.