The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web, as well as the number of new users (who are typically inexperienced at web searching), are growing rapidly. Search engines can help users to locate and retrieve documents of interest.
Users attempt to express their information need with search queries, but they often fail to choose effective query terms. For example, a user may enter the query [web hosting+fort wayne] when the city of Fort Wayne is usually referred to as Ft. Wayne. Or, a user may enter [free loops for flash movie] when most relevant pages use the phrase “music,” rather than “loops,” or the phrase “animation” rather than “movie.”
Thus, documents that satisfy a user's information need may use different words than the query terms chosen by the user to express the concept of interest. Since search engines typically rate documents based on how prominently the user's query terms are in the documents, this means that a search engine may not return the most relevant documents in such situations (since the most relevant documents may not contain the user's query terms prominently, or at all). This problem becomes progressively more serious as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that one of the words is not the best phrase to describe the user's information need.
As a consequence, there is a need for a method to modify or expand user queries to include or substitute synonymous query terms, so that retrieved documents may better meet the user's information needs. Solving this problem has proven to be difficult.
A simple approach to query expansion is to use pre-constructed synonym information, such as from a thesaurus or a structured lexical database like WordNet. However, thesaurus based approaches have various problems, such as that they are expensive to construct. Even when available, they are generally restricted to one language; meanwhile, there is a need to accommodate many languages, and to obtain synonym sets for each language.
A more significant issue is that the applicability of a synonym for a given phrase often strongly depends on the context in which the phrase is used. For example, “music” is not usually a good synonym for “loops,” but it is a good synonym in the context of the example query above. Further, this case is sufficiently special that “music” is not listed as a synonym for “loop” in standard thesauruses; many other examples of contextually dependent non-traditional synonyms can be easily identified. And even when conventional synonyms can be identified for a term, it can be difficult to identify which particular synonyms to use in the particular context of the query.
Other conventional approaches cluster “related words.” Such approaches suffer from the drawback that related words are not necessarily synonyms. For example, “sail” and “wind” would likely be clustered (because they both occur in numerous documents), but they are not synonymous. Substituting one for the other would lead to undesirable results.
Accordingly, what is needed is an automatic method that identifies potential synonyms, and that can determine contexts in which they are applicable.