In the field of text processing and database searches, an increasing problem has become the need to cross reference indexes using combined phrases. Among other uses, there has been a marked increase in demand for combined phrase cross-referencing engines in name searching tools. Name searching tools exist in several different contexts. For example, several hand held and cellular devices contain address books, and a combined phrase cross-referencing engine is necessary to conveniently search these address books. Similarly, through the advancement of voice recognition technology several companies offer online directories of topics and personnel. Databases such as those of telephone subscribers and hospital patients often require that a human or automated search for names be conducted without complete and accurate specification of spelling and parts. Because the spoken words are not always clear, the most likely possible words are taken from the sounds recognized by the system, and these words are cross-referenced against the company directories.
An effective cross-referencing engine will allow rapid comparison of the combined phrase or phrases to the referenced database. Users are generally unwilling to wait long periods of time to find matches for their queries, and in an environment where a live assistant is available (such as an operator or receptionist when using a voice recognition directory), the user may switch to the live assistant option to avoid the wait time. This would increase expenses to the company by increasing the staffing necessary to respond to the user. With other devices, convenience is a driving factor when a user is choosing a device to purchase. The speed with which a device can process a query may cause a user to purchase one device over another. Therefore, efficiency of the cross-referencing engine is critical in all the devices discussed above.
The other major concern for users in regard to cross-referencing engines is accuracy. When comparing a combined term to a list of single or combined terms, the user wants only the relevant results. However, the user also wants to ensure that no results are omitted. This delicate balance is extremely difficult to achieve. This is especially true with name searches. For such services, if a match is not exact, it is preferable that the differences and similarities be knowledge based (“mohd” is a conventional abbreviation for “Mohamed”) rather than “fuzzy” (“Hasan” and “Wasan” differ by one letter.) Frequently, when a name search is performed only a partial name is known. That may be, for example, the first letter of the first name and the entire last name. In this case, the user would want all possible names located. Another common problem is that names are commonly misspelled. For example, the name Thompson is also spelled Thomsen, Thomson, Tomson, Tomsen, and using several other variant spellings. A user may, in some cases, want to find all the variants of this name in case of a misspelling when searching a directory. This is common in company directories and other voice recognition directory systems. Performing an accurate search of a combined term generally consists of a complex series of iterations and lexicographic algorithms. These steps can significantly slow operation of the cross-referencing engine. Therefore, a system is needed in the art that efficiently cross-references a search term and all related variants against a database.
U.S. Pat. No. 6,018,708, entitled “METHOD AND APPARATUS FOR PERFORMING SPEECH RECOGNITION UTILIZING SUPPLEMENTAL LEXICON OF FREQUENTLY USED ORTHOGRAPHIES,” discloses a system for obtaining the most likely matches for input speech and introducing additional matches to the list based on prior usage of the system by the speaker. To obtain the most likely matches, multiple comparisons of the spoken word are made to a rated dictionary, and words from the dictionary are eliminated with each comparison according to the ranking of the words as a match to the input word. The additional match added to the list is a word that is introduced based on the frequency that the word was previously selected by the system based on input by the same speaker. The present invention does not use the method of this system. U.S. Pat. No. 6,018,708 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,182,039, entitled “METHOD AND APPARATUS USING PROBABLISTIC LANGUAGE MODEL BASED ON CONFUSABLE SETS FOR SPEECH RECOGNITION,” discloses a speech recognizer that generates a list of possible names from input speech by considering a group of acoustic pattern matching sequences. Essentially, because certain letters may be confused with other letters when heard by a speech recognizer (such as “f” and “s”), the speech recognizer considers all possible matches for the input speech. The list is generally compiled using a tree structure, N-gram structure or interactively configured on a network having nodes. The present invention does not operate using the same principles of this system. U.S. Pat. No. 6,182,039 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,557,004, entitled “METHOD AND APPARATUS FOR FAST SEARCHING AND HAND-HELD CONTACTS LIST,” discloses a method for searching a database in a hand held device for contacts that match an input data string. The device first searches for first names that match the first name of the data string. The device next searches a “filed as” field that contains first names and last names, company names, and any other user-definable name choice for matches for the data string. The results for the two searches are combined to generate the final result. The present invention does not use this method to perform data matching. U.S. Pat. No. 6,557,004 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 6,662,180, entitled “METHOD FOR SEARCHING IN LARGE DATABASES OF AUTOMATICALLY RECOGNIZED TEXT,” discloses a method for determining possible matches for input words in databases. This method is particularly well suited for optical character recognition and speech recognition, where the input words often are not immediately identifiable. With this method, the database is indexed by a trie data structure having a branch node and a number of leaf nodes, the combination of branch and leaf nodes representing a word. The first letter of the input word is identified, and this letter is use to search the database. The probabilities of the words of the trie structure are calculated based on the input word and the letters found in the particular trie structure. These probabilities are used to generate the results list, which includes the best matches. The present invention does not use this method to find matches for input data. U.S. Pat. No. 6,662,180 is hereby incorporated by reference into the specification of the present invention.
The difficulty in performing a database search when there is a query string that has a large number of possible alternatives (whether due to misspellings, variant transliterations or abbreviations, or erroneous system recognition such as in speech recognition) lies in determining all possible alternatives and comparing these against the database in an efficient manner. If less than all the alternatives are found, it is possible the desired result will be omitted. If the system performs an inefficient comparison, the system will take an inordinately long time to return results to the user. This problem is particularly complicated when the input is a data string containing multiple terms. It is therefore desirable in the art to have a database search system that is capable of efficiently searching multiple input terms, each term having multiple alternative spellings.