The development of the Internet and the “Global Village” concept has resulted in many distributed computer systems. Frequently, those systems have “core” information repository that can be either a large set of documents, programs, databases, tables in database or even Intranet web sites. Although there are many high-speed communication technologies are available today, it is still usually not practical for all systems in the distributed network to access a central location because failure to reach this location can result in total company information system failure. Moreover, the bandwidth required for such accesses is very large to render such accesses impractical. Many organizations handle this problem by either updating periodically all the branches, or mirroring on periodic basis a central repository to several distributed repositories, thereby achieving some level of fault tolerance. As transferring full repositories requires very large amounts of time due to large sizes of the repositories, many solutions have been developed to speed up this process.
The most common method is compressing the files before shipping them over using common general purpose compression algorithms such as LZ77 or LZ78. These methods, however, rely on the statistical properties of data, and is not suitable for compressing other types of data. For example, LZ77 usually shows very poor performance when handling executables. Another disadvantage associated with existing methods is that if a very small sector of the file is changed, all of the file data typically need to be compressed and resent to a receiving system.
Other methods use the fact that updates usually involve only minor change to the files. The most common of those assumes that the repository contains only text documents, and looks for the lines that have changed and transfers only those lines with additional information regarding where to place them and what other section of the file they replace. The most common example of this is the diff/patch/merge programs used in the UNIX environment. These programs, however, are ill-suited to handle binary data by definition.
Other, less common, systems try to find ways of comparing small fixed portions of the files. These types of systems appear to work only if the changes are made by replacing a section by another section, and requires that the sections have the same length. These systems, however, fail when arbitrary insertions and deletions are involved.
Therefore, it is highly desirable to have a method that can manage to quickly find similar portions of the file regardless of a specific attribute or type of the data such as text, spreadsheet, word processor, and documents. It is also highly desirable to have the method that can quickly differentiate changed portions in any types of files, regardless of the type of changes made to the file.