1. Technical Field
The present invention relates to comparison of data strings to find similarities between such data strings.
2. Discussion of the Related Art
String comparison algorithms are typically used for name comparisons in large volumes of data in order to enable names containing typographical or spelling errors to be equated. For example, consider a name “Patricia” and a typographical misspelling of the name, such as “Patircia”. These two data strings representing a name will score very high using conventional edit distance algorithms, such as Jaro-Winkler distance or Damerau-Levenshtein distance algorithms, where a high score provides an indication that this pair of data strings represents variant forms of the same name. Edit-distance algorithms such as the types referenced herein can be very helpful for short name strings in which the number of characters in the strings may be insufficient for other string comparison methods to generate acceptable similarity scores.
However, short strings present another difficulty for string comparison algorithms. For example, when considering name strings of three or four characters, a single letter difference or even a transposition of two letters may be enough to distinguish completely different names. Consider, for example, the name strings “Mair” and “Amir”. These two name strings contain a single transposition but they may be completely unrelated names. Similarly, the name strings “Bill” and “Jill” differ by one letter but are not related. String similarity calculations make no distinction in scoring between pairs like these (in which the name strings refer to different names) and a pair like “Patricia” and “Patircia” (which very likely represent the same name).
The inability to discriminate between certain string pairs is a significant weakness in name scoring algorithms, since it leads to increased numbers of false positives in search return sets. For example, using a scoring algorithm which allows a single character difference, a search on a name string “HAI LIN” could return any or all of the following: CAI LIN, BAI LIN, KAI LIN, LAI LIN, MAI LIN, NAI LIN, SAI LIN, TAI LIN, WAI LIN, XAI LIN, ZAI LIN, HAL LIN, HAK LIN, HAN LIN, HAO LIN, HAI LING, HAI WIN, HAI JIN, HAI QIN, HAI XIN, HAI LAN. Each of the returned name strings may be a legitimate name that is not related to the search name. However, conventional edit distance algorithms could very likely construe all of the return names as being the same.