Typographical errors, phonetic misspellings, abbreviations, common short-names, and sequence variation are but a few of the problems facing searchers of computerized records. For example, when calling directory assistance, if a request is made for the telephone number of Thomas Lee, without spelling either name, a telephone operator may search for Tomas Leigh, Thomas Lea, or any of several combinations thereof. In addition, when Thomas Lee was entered into the database, he may have accidentally been entered as Lee Thomas, and either or both of his names may have been misspelled.
The task of properly searching a computerized database becomes even more complex when names comprised of foreign characters are used. Examples of such databases include those containing genealogical records, foreign city names, foreign names, or company names.
To overcome these problems, some in the prior art have created techniques involving character manipulation. Soundex, which is one of the most widely used of these techniques, is a simple process of associating certain letters with numbers, and dropping other letters. A search is performed on the result, and that search may yield names that sound like or otherwise approximate the name in question.
Others in the prior art have described schemes through which result sets may be generated based on manipulation of an input word. One such technique, disclosed by U.S. Pat. No. 4,833,610 by Antonio Zamora, et. al., separates and alphabetizes the consonants and vowels of a given word, and compares a transformed input string to transformed database entries. Another technique, disclosed by U.S. Pat. No. 5,737,723 by Michael Dennis Riley et. al., compares dictionary words based on the phonetic confusability of the words. Still another method, disclosed by U.S. Pat. No. 5,724,597 by Robert John Cuthbertson et. al., involves successively applying Soundex and other techniques and generating a match list based on the results.
While Soundex and other such schemes may allow the reporting of “near matches,” the number of false positives reported by these schemes can prohibit their use in large databases. For example, in a database of 1000 names, if the Soundex routine had a false positive rate of 0.005, only two false names would be returned. However, when the database grows to 100,000 names, over two hundred false positives are reported.