There exists a general need in the area of file editors to quickly and efficiently identify changes between earlier and later versions of two files. As used herein, the term “file” should be interpreted broadly to include a logical grouping of a sequence of units, and a unit is any group of data within a file that is subject to comparison (e.g., a line of text within a source code file). One method of comparing two files and identifying differences comprises using a longest common subsequence (LCS) algorithm. An LCS is the set of properly ordered lines that can be identified between two files. In the case of two text files (e.g., source code files) having N and M lines respectively, if there exists an LCS consisting of L lines, then the number of differences “D” equals M+N−2L. In other words the number of differing lines equals the total number lines minus the set of lines in the LCS that are determined to be unchanged between the two files.
An LCS search algorithm is described in “An O(ND) Difference Algorithm and Its Variations,” by Eugene W. Myers, and published in Algorithmica #2, 1986, 1:251-266. The complexity of computing the LCS using such algorithm is O(N+M+D^2) in the typical case involving text files and O((N+M)*D) complexity in a worst case. The worst case involves files with many repeating lines, e.g. {a, b, a, b} and {b, a, b, a}. In such cases the LCS algorithm checks a number of different possibilities. Meaningful text files usually have a relatively small number of repeating lines. As one can see from the above equations, the complexity of a search for the LCS to identify changes between two files is highly dependent upon “D”, the number of the number of differing lines present in the two files. Thus, tracking changes in large files containing many differences can consume vast amounts of time and computing resources.