The present invention relates to methods for computing the minimum edit distance and at times will be referred to by the name “Yameda” which is an acronym for “Yet Another Minimum Edit Distance Algorithm.”
This application also incorporates by reference herein as follows:    James W. Hunt, Thomas G. Szymanski: A Fast Algorithm for Computing Longest Subsequences. 350-353 Communications of the ACM (CACM), Volume 20, Number 5, May 1977;    E. Myers. An O(ND) difference algorithm and its variations. Algorithmica, pages 251-266, 1986; and    Andrew Tridgell, Paul Mackerras, The rsync algorithm, Department of Computer Science Australian National University Canberra, ACT 0200, Australia http://samba.anu.edu.au/rsync/tech_report/, circa 1998.
The Hunt and Szymanski method and its variations are in common use on every UNIX/LINUX user's desktop in the form of the “diff” command that can be typed in at the prompt. The “diff” command computes the minimum edit distance between two text files on a line-by-line basis. When confronted with a binary file, the “diff” command will print a message on the order of “binary files differ” without any more detailed information. The reason for this limitation is simple: The Hunt and Szymanski method requires a time of O(R2), where R is the number of elements being compared that are common to both files. In the case of a line-by-line comparison, the Hunt and Szymanski method is efficient because each line of text is relatively unique. However, the “diff” algorithm is inappropriate for comparing the same files byte-by-byte, because at that point the quadratic behavior becomes significant because R increases due to the small “alphabet,” or set of unique symbols in the sequence being compared. For a line-by-line comparison the “alphabet” is the set of all lines in each file; for a byte-by-byte comparison, it is the set of all 256 possible bytes in the ASCII i.e., American Standard Code for Information Interchange. For files consisting of only printable characters, the inefficiency of “diff” increases and, for DNA sequences, R will be at least N/4, where N is the size of the problem.
On a UNIX/LINUX workstation, the quadratic behavior of “diff” can be made visible by comparing two files with the following contents:    File 1: A    <a large number of blank lines>    B    vs.    File 2: X    <a large number of blank lines, not necessarily the same length as above>    Z
The number of blank lines in each file can be different. It is necessary to put unique lines at the beginning and the end of each of these files due to the optimization within “diff” that strips off the beginning and end of each file as long as they are the same.
A second method of Meier86 runs in O(ND), and for certain input data, can perform faster than the Hunt and Szymanski method, i.e., when the number of differences, D, are small. The Meier86 algorithm is efficient for small alphabets such as DNA bases or bytes. However, like “diff”, the Meier86 algorithm can easily be made inefficient with the above example where there are a large number of differences: the first file would have, for instance, 10,000 blank lines and the second file would have, for instance, 20,000 blank lines. Thus, N, the problem size, is greater than or equal to 10,000 whereas D is also greater than or equal to 10,000, and thus the Meier86 method requires ND, or 100,000,000 steps. This becomes an obstacle, even in practice, as many edits consist of mostly insertions or deletions alone. In a nutshell, the Meier86 algorithm can be thought of as an instance of the Dijkstra algorithm for searching a graph of minimum edit distance subproblems, whereas Yameda is based instead on the faster-converging A* (pronounced A star) search algorithm or simply A* search.
A third method, the rsync algorithm, can be used for delta compression for the purpose of transmitting files. In this algorithm, some blocksize B is chosen (e.g., 1024 bytes) and a rolling checksum is computed for each byte position which summarizes the last 1024 bytes. Pairs of blocks that match using the rolling checksum are confirmed using a more powerful checksum, and thus the rsync algorithm can be used for delta-compressed transmission of information, i.e., over the Internet even when both files are not present.
However, the rsync algorithm does not solve certain problems. Rolling-checksum algorithms overestimates the number of differences, and if these differences are sprinkled about in the file at a density on the order of one per block size, the whole block will be sent, stored, or displayed as a difference rather than the smallest difference. Thus, the whole file may yet be sent or stored. If the block size is too small, there will be many false matches, and hence incorrect results.
Thus, algorithms such as rsync cannot be used for applications such as displaying the true minimum edit distance on the user's console to confirm that the edits are as expected, three-way merge of files in HTML, XML, rich text, or any other format, or for those applications where a high degree of delta-compression is desired, and/or it is likely that at least one difference will exist in every block.