A search engine is computer software used to search data for specified information. The software of course must reside on hardware for execution. Accordingly, the terminology “search engine system” as used herein references a search engine and its supporting hardware. Some embodiments of search engine systems search for character strings, or “words,” in a set of documents and return lists of the documents containing the words. Other embodiments of search engine systems search a single document for specified words and return indications of where the words are located within the searched document. Some embodiments of search engine systems do both.
It is frequently desired that a search engine achieve two goals: retrieving results accurately and retrieving those results fast. The right innovative combination of many known individual components working together in a new way can drastically affect performance over conventional search engines and thus achieve one or both of those goals. The present disclosure focuses as follows on achieving both goals when executing a specific type of search engine system.
Consider a visitor to a lengthy web page, a manager accessing an ERP (enterprise resource planning) comprehensive electronic report, or anyone opening a large electronic file with a text editor and observing a large amount of text without necessarily seeing immediately the particular item of information that interests him/her. (Hereinafter, for simplicity of discussion and not to limit the disclosure, the male pronouns are used arbitrarily to reference users of either gender.) In such a situations, and in many analogous situations, the user can find the information of interest by text searching an appropriately-chosen text string or word (or name), hereinafter referred to sometimes as a “search argument.” An example would be searching for the word “olga” in a large web page to find the location therein of a portion of text that discusses Olga Korbut's performance in the 1972 Summer Olympic Games.
A problem arises when the user does not spell the search argument correctly. If the search utility retrieves only text that matches the spelling of a search argument, the incorrectly-spelled search argument would not enable retrieval of the information of interest. Misspelled word entry occurs not only when a user, who knows how to spell the word correctly, accidentally enters a typographical error, but also when the user does not know the correct spelling and attempts to search using an incorrectly-spelled entry. A user unsure of a correct spelling may need to conduct multiple searches to find the information of interest.
Regardless of whether a user thinks he knows a correct spelling of a word or he instead intentionally guesses at the correct spelling, the word's sound strongly influences how many users spell a word. Accordingly, algorithms, known as “phonetic algorithms,” have been developed to index words according to their pronunciations and thus assist users in finding the text that interests them within large documents or within large sets of documents. A search engine implementing a phonetic algorithm is a phonetic search engine, and a phonetic search engine and its supporting hardware is a phonetic search engine system in the context of the present disclosure.
Accordingly, to operate a phonetic search engine system, a user merely needs to enter how he thinks a word is spelled, and the algorithm retrieves words that sound similar to the word. For example, some phonetic algorithms will retrieve the words “Alan” and “Allen,” when a user enters only “Alan.” If “Allen” is the spelling associated with the information that interests the user, the phonetic algorithm would enable him to find information that his improper spelling would not have allowed, if the user instead used an algorithm that retrieved only exact character string matches.
Phonetic algorithms, even historic phonetic algorithms, implemented elaborate logic. An early phonetic algorithm, Soundex, produced a code for a search word and also for words within the document to be searched. The Soundex code for a particular word began with the first letter of the word to be coded, and all subsequent vowels and the letters “h”, “y”, and “w” were omitted from the code. The letters after the first letter of the words to be coded, which were consonants, were replaced by a number that was associated with a group of letters that were articulated similarly. For example, the number “1” was associated with the labial consonants “b”, “f”, “p”, and “v”. If the same articulation number was associated with two or more adjacent letters in the words, each such letter after the first was omitted. If the same articulation number was associated with two letters separated by “h” or “w”, the second letter was omitted. If instead two letters of the original word having the same articulation number were separated by a vowel, the articulation number was used twice. The Soundex code had the original letter and three numbers. Zeros were added at the end, if the word did not have enough letters suitable for providing three articulation numbers, and articulation numbers were not added to the Soundex code after the first three are generated. Accordingly, “baby” and “babe” both have the Soundex code B100.
Although the Soundex phonetic algorithm became widely accepted, search engines systems were nonetheless subsequently developed to more accurately retrieve phonetically similar words from files stored on the computer systems. Consider the following example system configured to retrieve from computer files (1) words having the same consonants as those in the search argument and in the same order, (2) words having the same consonants in the same order except for one consonant that is entered incorrectly, and (3) words having the same consonants in the same order except for one missing consonant. While this search engine system indeed meets the goal for increased accuracy over the systems implementing the Soundex algorithm, the search engine does not meet the goal of increased speed.
This algorithm is slow, because the algorithm requires, for each search for a particular search argument having N consonants, (1) one search for words having the same consonants in the same order as those of the search argument, (2) 20N searches for words having the same consonants in the same order expect for one consonant that is entered incorrectly, and (3) 21(N+1) searches for words having the same consonants in the same order except for one missing consonant. This analysis assumes a search in the English language with 21 consonants. A search in another language or a search in English with a different number of letters designated as consonants for the particular phonetic algorithm would have different quantitative results, but the basic principle is the same.
The reason that 1+20N+21(N+1)=41N+22 searches are necessary to search for one word is the following: First, for this algorithm there are 21 designated consonants in the English language. The reason that there are 20N searches required when checking for words having the same consonants in the same order expect for one consonant that is entered incorrectly is that there are N consonants in a search argument that could have been entered incorrectly and 20 possibilities for the incorrectly entered consonant. The reason that there are 21(N+1) searches for the correct word when the search argument was entered with one missing consonant is that, for a search argument having N consonants, there are N+1 possible places where the missing consonant was supposed to be and there are 21 possibilities for the omitted consonant.
Executing 41N+22 searches in response to one word entered as the search argument is quite resource intensive. Accordingly, the present inventors endeavored to develop a new phonetic search engine system that met both the goals of accuracy and speed. They realized that the goals could be met by using new components, known components combined in new synergistic ways, or a combination of both.