An edit distance between two character strings is the number of single-character insert and delete keyboard operations required to change a first string into the second string. We denote the edit distance between string x and string y as ed(x, y). A Hamming distance between two bit strings of the same length is the number of bits in the first string that must be switched to result in the second string, or in other words, the number of bit positions in which the two strings differ. We denote the Hamming distance between string x′ and string y′ as H(x′, y′). For ease of expression we refer to input strings in general as character strings, and output strings as bit strings, since we are discussing mapping character strings (in the edit distance metric) to bit strings (in the Hamming distance metric). However, details of the method described herein may refer to input strings as bit strings as well.
There are many useful applications involving strings for which very efficient solutions are known in the Hamming distance metric, including sketching and nearest neighbor search. See, e.g., U.S. Pat. No. 6,226,640 (the '640 patent) issued May 1, 2001, to Ostrovsky and Rabani, titled “Method for determining approximate hamming distance and approximate nearest neighbors of a query,” the contents of which is incorporated by reference as if set forth fully herein. For character strings in the edit distance metric, such solutions are scarce but needed. For example, edit distance plays a central role in text processing and many web applications. Furthermore, fast estimation of edit distance and efficient search according to the edit distance are widely investigated and used in computational biology. Thus, it would be desirable to provide a method to map input character strings to output bit strings such that the Hamming distance between two output bit strings is approximately proportional to the edit distance between their corresponding input character strings. Such a mapping would allow known procedures in the Hamming distance metric to be performed on the output strings and yield similar (within an accepted tolerance) results as if the procedures were available in the edit distance metric and were performed on the input character strings in the edit distance metric.
Previously, such a mapping was not known, as stated in the following publication, the contents of which is incorporated by reference as if set forth fully herein: “Lower bounds for embedding edit distance into normed spaces.” A. Andoni, M. Deza, A. Gupta, P. Indyk, and S. Raskhodnikova. Proc. of the 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, Baltimore, Md., January 2003, pages 523-526.