1. Field of the Invention
The invention relates generally to methods and apparatus for matching strings.
2. Background Art
Approximate string matching techniques are used in searching for strings that match a query term. Approximate string matching is an important task in applications such as spell checking in text processors, information retrieval in databases, net and latch mapping in computer-aided design, protein and nuclei sequence identification in molecular biology, and handwriting and speech recognition. Approximate string matching techniques involve finding occurrences of a pattern string P=p1p2 . . . pm in a text string T=t1t2 . . . tn, where ti, pi belong to some known alphabet. Approximate string matching techniques find all locations j in T such that there is a suffix of T=[1 . . . j] matching P with k or fewer differences, where k is greater than or equal to zero. When k is zero, the matching scheme is said to be exact. Approximating string matching techniques have been studied extensively in the field of computer science. See, for example, Amihood Amir and Martin Farach, xe2x80x9cEfficient 2-dimensional approximate matching of non-rectangular figures,xe2x80x9d Proceedings of the second annual ACM-SIAM symposium on Discrete algorithms, 1991, pages 212-223, and Richard Cole and Ramesh Hariharan, xe2x80x9cApproximate String Matching: A Simpler Faster Algorithm,xe2x80x9d Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, 1998, pages 463-472.
Approximate string matching techniques involve computing the xe2x80x9cedit distancexe2x80x9d between two strings. The xe2x80x9cedit distancexe2x80x9d between two strings is the minimum number of insertions, deletions, and substitutions required to convert one string to the other. The objective of approximate string matching techniques is to determine the cost edit distance, i.e., the minimum number of edit operations, required to transform one string to the other. The most common method of computing cost edit distance is dynamic programming. See, for example, Gene Myers, xe2x80x9cA Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programmingxe2x80x9d Journal of the ACM, Vol. 46, No. 3, May 1999, pages 395-415. The exact nature of dynamic programming is known and will not be discussed in this application. There are other methods for determining cost edit distance which do not involve dynamic programming. For example, U.S. Pat. No. 5,761,538 issued to Hull discloses a method for estimating cost edit distance which includes equalizing the lengths of two strings by adding padding elements to the shorter one of the strings. The two strings are sorted according to their element values. Then a sum of substitution costs of the elements in corresponding positions in the sorted strings are calculated. The sum of the substitution costs are then set as the lower bound estimate of the cost edit distance.
In one aspect, the invention is a method for comparing two delimited strings, each of which has a plurality of substrings. The method comprises pairing each substring in one of the delimited string with a corresponding substring in the other one of the delimited strings, computing a proximity value for each pair of substrings, and computing a set of decaying weights corresponding to the pairs of substrings. The method further comprises multiplying the proximity value for each pair of substrings by the corresponding weight and summing the weighted proximity values to obtain a strength of match between the delimited strings.
Other aspects and advantages of the invention will be apparent from the following description and the appended claims.