Searching and matching techniques provide useful ways to cleanse and standardize data in databases to improve its data quality. For example, some data in the database may be incorrect due to user input errors. Common sources of errors include entering strings that “look-like” or “sound-like” the intended data. Such input data may be corrected by finding a candidate record from the reference universe that fuzzily matches the incorrect input data, and replacing the incorrect input data with the matching candidate data before storing it in the database.
One type of phonetic fuzzy matching method is the Soundex algorithm, which was first developed by Robert C. Russell and Margaret K. Odell in 1918 and 1922. See U.S. Pat. Nos. 1,261,167 and 1,435,663, which are hereby incorporated by reference. The Soundex algorithm indexes data by sound, as pronounced in English, by encoding homophones to the same representation or key, so that they can be matched despite minor differences in spelling. To compare two given strings, the Levenshtein distance (or edit distance) may be determined. The Levenshtein distance measures the difference between two strings by calculating the least number of edit operations that are necessary to modify one string to obtain another string.
The problem with conventional algorithms, however, is that they are not well-suited for ideographic or non-English characters, such as Chinese or Japanese characters. The phonetic rules used in conventional phonetic algorithms are designed for English pronunciations, and not for pronunciations of other languages. In addition, conventional phonetic algorithms do not take into account possible different pronunciations by different dialect groups. Similarly, Levenshtein algorithms may not be directly applicable to ideographic or non-Latin strings because of their short-string formats.
Therefore, there is a need for an improved searching and matching framework that addresses the above-mentioned challenges.