A problem that arises in different contexts is the identification of portions of one file (i.e., contiguous strings of data) that are identical to portions of another file (i.e., identical strings of data in the other file). This problem is illustrated graphically in FIGS. 1A and 1B. More specifically, FIG. 1A illustrates the contents of a file 1 and FIG. 1B illustrates the contents of a file 10. For ease of illustration, the contents of files 1 and 10 are illustrated as being continuous functions, even though in practice this generally will not be the case, e.g. for documents that primarily include text (i.e., characters such as numbers and letters) or for extensible markup language (XML) documents.
The identical segments 2 and 3 that are common to both of files 1 and 10 are highlighted using thicker lines in FIGS. 1A and 1B in order to make them easy to identify visually. In practice, however, it generally is very difficult and processor-intensive to identify such common segments 2 and 3. This is because both of files 1 and 10 often will have other data segments 4-6 that are interspersed between the common segments 2 and 3. As a result, it typically is difficult to even identify where the common segments 2 and 3 begin in each of the files being compared. This problem is further compounded when a large number of files need to be compared in a very efficient manner.
A simple approach to the problem is to divide the file into regularly spaced intervals and then compare the intervals. However, as shown in FIGS. 1A and 1B, this approach does not work well unless the selected intervals are precisely positioned, relative to both of the files to be compared, at least with respect to the common segments 2 and 3.
Unfortunately, this is not the case in the present example. Here, segment 2 in file 1 is covered by intervals 12 and 13 while the identical segment 2 in file 10 is covered by intervals 14-16. Although segment 2 clearly is identical in both of files 1 and 10, merely comparing the data segment in either of intervals 12 and 13, on the one hand, with the data segment in any of intervals 14-16 would not indicate similarity. In fact, where sequences share multiple common segments that are separated from each other, it often will be impossible to identify on an a priori basis a single set of windows that will properly capture each common segment.