1. Field of the Invention
The present invention generally relates to systems and methods for representing the differences between collections of data stored on computer media. More particularly, the present invention relates to systems and methods for transmitting updates to such data using a representation of the differences between the updated version and a previous version or versions.
2. Description of the Related Art
The need to distribute large quantities of electronic information, typically via computer networks, arises in many applications involving geographically distributed computer users. In many such cases the information distributed must be maintained in an up-to-date state at the destination(s). An important goal in the distribution of these updates is to reduce the amount of data which must be sent in order to make the update.
In many cases reduction in the data size of the updates is achieved by means of some form of ‘differencing’. In such methods the sending computer system calculates the differences between the version of the data which the receiving computer system already has and the updated version it is desired to distribute. A representation of these differences is then transmitted to the receiving computer system which uses it together with the previous version which it already has to construct the updated version of the data.
Many existing methods for producing a difference representation are known. Examples are the UNIX ‘diff’ utility, and iOra Limited's Epsilon Technology (U.S. patent application Ser. No. 09/476,723 filed on Dec. 30, 1999). However, the known methods have a tendency to produce large representations of the differences between one version and an updated version with many common forms of non-textual data. Specifically, data types in which differences tend not to be localized within the data generally produce large difference representations. Important cases of such data types include the following categories:    1) Executable files. Typically small changes made to computer source code (e.g., in small problem fixes) result in non-localized changes to the executable file(s) produced by building the source code. A major cause of this effect is that the insertion or modification of small regions of code or data variables will often cause unchanged data and sub-routines to be moved to different addresses throughout the executable. All references to such moved data or sub-routines then change throughout the executable file image. The effect of this can be considerable.    2) Compressed files. Many data types are typically represented in compressed form so that they take up less space on hard drives and require less time for transmission over computer networks. Small changes to the uncompressed content of such files may then cause large and non-localized changes to the compressed form. Important examples of these data types are the ZIP and CAB compression formats (often used in software distribution) and multimedia files such as images (e.g., GIFs and JPEGs, which are formats frequently used on web pages), sound files, or movies (e.g., MPEGs).
Accordingly, what is needed is a way to allow the efficient (in the sense that small difference representations are produced) differencing of data types in which non-localized changes are a feature.