Searching and matching techniques provide useful ways to retrieve data from databases. Fuzzy string matching (non-exact or approximate string matching) is a technique of finding strings (or data) that match a pattern approximately, rather than exactly. One exemplary application of fuzzy matching is broadening the search results for a given input. Another exemplary application is cleansing and standardizing data to improve its quality. For example, some data in the database may be incorrect due to user input errors. Common sources of errors include entering strings that “look-like” or “sound-like” the intended data. Such input data may be corrected by retrieving a candidate record from the reference universe that approximately matches the incorrect input data, and replacing the incorrect input data with the matching candidate data before storing it in the database.
One type of fuzzy matching technique is the Soundex algorithm, which was first developed by Robert C. Russell and Margaret K. Odell in 1918 and 1922. See U.S. Pat. Nos. 1,261,167 and 1,435,663, which are hereby incorporated by reference. The Soundex algorithm indexes data by sound, as pronounced in English, by encoding homophones to the same representation or key, so that they can be matched despite minor differences in spelling. To compare two given strings, the Levenshtein distance (or edit distance) may be determined. The Levenshtein distance measures the difference between two strings by calculating the least number of edit operations that are necessary to modify one string to obtain another string. The problem with conventional algorithms, however, is that they are not well-suited for ideographic or non-English characters, such as Chinese or Japanese characters. The phonetic rules used in conventional phonetic algorithms are designed for English pronunciations, and not for pronunciations of other languages. In addition, conventional phonetic algorithms do not take into account of possible different pronunciations by different dialect groups. Even further, the cost of computing the Levenshtein distance is roughly proportional to the product of the two string lengths, which makes it impractical for longer strings or large datasets.
Accordingly, there exists a need for new and innovative solutions for searching and matching ideographic and non-English characters.