The present invention relates to a computerized method and apparatus for matching misspellings caused by phonetic variations, and in particular to such a method and apparatus that is applicable to the matching of proper nouns and other words transcribed into Romanized script from non-Roman languages.
In the data-driven environment of the world today, many businesses maintain or use large amounts of personal data that is recorded, processed, and standardized over a period of time. The vast volumes of data include the names and addresses of consumers and businesses. Such information may be collected from many different sources, with different formats and different degrees of accuracy and standardization. Retailers and others who collect this information may wish to correct, de-duplicate and standardize the information (i.e., data “hygiene”), or to supplement the data they have with additional information, either about existing customers or prospective customers. Very large, standardized databases containing this type of consumer information include the InfoBase product of Acxiom Corporation, which facilitates for retailers and others these functions, as well as additional functions such as real-time verification of a potential customer's identity. The success of these systems largely depends on the degree to which they can overcome discrepancies in variations or transcription errors in names, addresses, and other strings maintained in these records as collected from many different sources. String matching based on simple algorithms such as finding the distance between variations have proven ineffective, and hence more sophisticated phonetic variations or pattern matching algorithms must be applied.
The problem of name variations (either spelling or phonetic variations) and the difficulty of trying to match names based on identification of variations is greatly magnified when the information crosses between different languages, different language families, and different scripts applicable to those languages. These spelling and phonetic variations in proper nouns have been a consistent problem in various applications, particularly with respect to data hygiene applied to names, addresses, and other such terms on an international basis. One service in which this type of standardization is performed is Global Hygiene Services (GHS) offered by Acxiom Corporation. GHS is used for standardization of businesses, addresses, and names for more than one hundred countries across the world. Most of the information retrieval and storage systems at large data services providers have the inbuilt capability to record and process personal data from multiple sources. In such circumstances it is of utmost importance to be able to differentiate when references are made to the same entity or when duplicate entries exist, despite differences in language or script. This issue is encountered on a daily basis in hygiene systems such as GHS when a language expert is made responsible for standardizing a vast quantity of inputs. In such cases a language expert must spend a great amount of time and effort in identifying duplicates or matching proper nouns that are misspelled due to spelling or phonetic variations. Hence it is a non-trivial task to not only purge duplicates but also match proper nouns that are misspelled across different languages. These proper nouns may undergo several phonetic and spelling variations due to different pronunciations, naming conventions, languages, syllables, individual preferences, and cultural diversity. The failure of standardized algorithms for this purpose has required that much of this process be performed manually by such experts.
Most of the variations addressed in GHS and other data hygiene systems can be categorized as variations in spelling; variations in phonetics; or variations in character. Variations in spelling are primarily caused due to typographical errors (letters exchanged), unnecessary substitution during transcription, and the addition of letters or sometimes even deletion of characters (transposition). Usually such variations are caused due to mispronunciation or mishearing that does not affect the phonetic structure. Variations in phonetics occur where the structure of the proper noun is significantly modified due to alterations in phonemes. For example, the business name “Makudonarudo” in Japanese and “McDonald's” in English are related names but their phonetic structure appears completely different, increasing the complexity in matching them algorithmically. Variations in character include changes due to capitalization, punctuation, spacing, and abbreviations, which compared to the other problems are relatively well handled by data hygiene services when treated alone. But the combination of these variations, as well as potentially distinct words from different languages, makes the matching process a very challenging task.
It may be seen that the primary objective of matching in certain contexts as set forth above is to determine if two or more computer records relate to the same person, object, event, or other proper noun. One simple approach to string matching may be based on determining the “distance” between the two strings. A common string distance measure is the Levenshtein distance. The Levenshtein distance between two character strings is the minimum number of changes (such as adding a character to the string, deleting a character from the string, or replacing a character in the string with a different character) that must be made in order to transform one of the character strings into the other character string. It may be seen that Levenshtein distance is of limited utility in matching words based on phonetic differences, such as in the “McDonald's” example given above, since these two proper nouns may have many character changes that result in a high Levenshtein distance value even though the words are in fact related.
There are a number of algorithms in the prior art that have attempted to solve this phonetic matching problem in a general fashion, ranging from identifying simplistic variations to those that take phonetic variations into account. Many of these methods are language specific, with highly complex mechanisms for parsing and matching variations. Some of the most popular prior art methods include Soundex, Phonex, NYSISS, and Guth Matching.
Soundex was initially developed for use with English phonetics. The technique standardizes each variation by converting it to an equivalent four-character code. Several variations, such as Henry Matching and Daitch-Mokotoff coding for Slavic and German spellings, exist. A major disadvantage of Soundex and its variants is that it needs the first letter of the proper noun to be correct. Thus any spelling or phonetic variations at the beginning of the proper noun will eventually get propagated to the rest of matching, and result in a completely different Soundex code and thus a likely matching error.
Phonex is a prominent variation of Soundex, which includes the additional complexity of preprocessing proper nouns based on their English pronunciations before the actual encoding begins. As with Soundex, the leading character of the proper noun is still maintained, affecting only the remainder with increased complexity. This approach is also not language independent.
NYSIIS, based partly on Phonex, is a relatively slow algorithm with a high degree of complexity due to the application of hundreds of transformations at the beginning, middle and sometimes even at random positions of the string being analyzed.
Guth Matching is based on alphabetic characters from left to right and has many advantages over Soundex such as data independence, alternate spelling considerations, and does not need prior generation of a sorting key. The algorithm has proven to be relatively weak, however, when comparing shorter proper nouns.
A number of metrics may be applied to analyze these various prior art algorithms. Such metrics include the total number of pairs of known words in the dictionary by a language expert; the percentage of true matches; the percentage of true mismatches; the overall accuracy; the number of comparisons performed in the dictionary for rule generations; and the time of execution (i.e., the time taken or elapsed to match two unknown words not in the dictionary). None of these various known algorithms provide a high number of matches as measured by these various standards, and thus an improved apparatus and method for matching may be seen as highly desirable.