1. Field of the Invention
The present invention relates in general to computer-implemented spelling correction and more particularly to a spelling correction system and method for phrasal strings, such a search string used in a search engine query.
2. Related Art
Spelling correction (or spell checking) is a widely used tool in computer applications (such as word processing applications) that verifies the correct spelling of words in written documents. A typical spelling correction technique works by encountering a word that is not in a dictionary of words (e.g. a potentially misspelled word to be checked). The potentially misspelled word is compared to words in the dictionary and the word representing the closest match is returned as the correct spelling of the word. Current spelling correction techniques were developed for spelling correction of text documents and identify words within the text by determining where spaces occur. In other words, a word within the text document is identified as a space-delimited string. In general, a space-delimited string includes a string of letters that is set apart by spaces or punctuation characters. This means that current spelling correction techniques consider the letters between the spaces (i.e. a space-delimited string) to be a “word”.
There are numerous spelling correction techniques currently available. For example, some existing spell spelling correction techniques are based on a framework that allows a spelling provider to manually specify the allowable edit types and the weights that are associated with each edit. Other types of existing spelling correction techniques use context to determine the correct word and first pre-compute a set of edit-neighbors for every word within the dictionary. In these edit-neighbor techniques, a word is an edit-neighbor of another word if the word can be derived from the other word in a single edit. In this situation, an edit is defined as a single letter insertion, substitution or deletion, or a letter-pair transposition. For every word in a document, the edit-neighbor technique determines whether any edit-neighbor of a word is more likely to appear in that context than the word that was typed. All edit-neighbors of a word are assigned equal probability of having been the intended word, and then the context is used to determine which word to select. Still another type of spelling correction technique computes the probability of an intended word w given a string s by automatically deriving probabilities for all edits. Similar to the edit-neighbor technique, this probabilistic technique pre-computes a set of allowable edits and only considers words w that are a single edit away from string s. Yet another spelling correction technique is a learning string edit technique that learns the probabilities for all edits, where edits are limited to single character insertion, deletion and substitution. This learning string edit technique allows for a string s to be derived from a word w by any number of edits.
All of these spelling correction techniques were developed with text document processing in mind. These techniques do not work well, however, when they are applied to strings of misspelled words. These strings are typically phrases and are known as phrasal strings. A phrasal string is a string of characters that is not necessarily space-delimited or punctuation-delimited. In other words, the string is not necessarily delimited wherever spaces occur. An error-filled phrasal string may contain unreliable spaces and many nonstandard words. For purposes of this application, nonstandard words include words that are not contained in a standard dictionary or that may have a different meaning than commonly assigned to the words. By way of example, a search engine generally has a type-in text box whereby a phrasal string may by entered by a user. Spelling correction is important for search engines because since most queries are short even a single misspelled word or misplaced space will be extremely problematic, and can lead to highly erroneous results. Current spelling correction techniques, however, perform poorly when used for spelling correction in applications using phrasal strings. This is because current techniques perform spelling correction based on space-delimited strings and are unreliable and ineffective in situations where a user is typing a phrasal string (such as a search engine query) and accidentally inserts or omits a space.
By way of example, assume that a user inputs a line of text into a search engine query that contains the misspelled phrase “the backs treetboys”. If one of the current spelling correction techniques is used to spell correct the textual line, then when the misspelled phrase is encountered the spelling correction technique will determine whether each space-delimited string (i.e. “the”, “backs” and “treetboys”) is in the dictionary. Because the space-delimited strings “the” and “backs” are in the dictionary, the spelling correction technique will assume them to be correctly spelled. Moreover, because the space-delimited string “treetboys” is not in the dictionary, the spelling correction technique will assume that the string is misspelled and search for the closest match in the dictionary. It is likely that a close match will not be found and the response such as “Not in Dictionary” will be returned to the user. This can be especially frustrating to a user who had intended to enter in the query box of the search engine the phrase “the backstreet boys” and accidentally added a space at a wrong location.
Another problem with using current spelling correction techniques in a search engine environment is that a dictionary is difficult to define. This is because some terms (especially proper names) used in search engine queries do not occur in standard dictionaries. Another problem is that the concept of a word is much looser in a search engine query (as compared to a word processing application) and users will often omit spaces between words. Still another problem is that case information is considerably less reliable in a search engine query than with a word processing application. This makes it difficult to know whether a potentially incorrect word not found in the dictionary is a proper name or a misspelled word. Because current spelling correction techniques use a dictionary containing only single words (or space-delimited strings), these existing techniques are ineffective when applied to phrasal strings.
Still another problem with using current spelling correction techniques for phrasal strings is that the word dictionary used by current spelling correction techniques is essentially static while phrasal strings used in search queries are dynamic. Existing techniques use a static dictionary containing words (or space-delimited strings) for a certain language (such as English). There is no need for the dictionary to be constantly updated because new words are added to the language and old words fall out of use relatively slowly over a long period of time. Over a reasonable period of time, therefore, a dictionary for a language can be considered static. Conversely, phrasal strings used in search queries are quite dynamic, with the probability of certain phrases being used in a query varying widely from one day to the next. For example, if a major world event suddenly occurred then phrases pertaining to that event would be used often. On the other hand, a week later the phrases pertaining to that event may be old news and replaced with phrases corresponding to a more current event. In addition, search engine query terms often include jargon that does not have the standard dictionary meaning. A static dictionary cannot update itself to reflect current high-use search phrases and new meanings for words.
What is needed, therefore, is a system and method for spelling correction that is capable of operating on phrasal strings and is not limited to a single word or space-delimited text. The spelling correction system and method must be effective for spell correcting phrasal strings and such that the problem of erroneous insertions and deletions of spaces in a phrasal string can be overcome. Moreover, what is needed is a spelling correction system and a method that contains a phrasal dictionary containing entries that can be arbitrary word strings and phrases instead of merely single word entries. In addition, what is needed is a phrasal dictionary that is dynamic and is capable of being updated to include high-use phrases and words.