Embodiments of the present invention generally relate to the field of information retrieval, and more specifically to the task of identifying valid synonyms for query terms to facilitate retrieving documents which relate to the query terms.
The relentless growth of the Internet makes locating relevant information on the World Wide Web (the Web) an increasingly challenging task. While search engines can help users locate and retrieve a document of interest on the Web, users often fail to select effective query terms during the search. The problem of finding desired query results becomes increasing challenging as the amount of information available on the Web continues to grow.
For example, a user may enter the query [Web hosting+fort wayne] when the city of Fort Wayne is usually referred to as Ft. Wayne. Or, a user may enter [free loops for flash movie] when most relevant pages use the term “music,” rather than “loops” and the term “animation” rather than “movie.” Thus, documents that satisfy a user's informational needs may use different terms than the specific query terms chosen by the user. This problem is further aggravated as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that at least one of the terms is not the best term to describe the user's intended search. It is therefore desirable for a search engine to automatically modify and/or expand user queries to include synonyms for query terms, so that retrieved documents can better meet the user's informational needs.
This task has proven to be difficult. A simple approach is to use pre-constructed synonym information, for example, from a thesaurus or a structured lexical database. However, thesaurus-based systems have various problems, such as being costly to construct and being restricted to one language.
Some systems consider how often users substitute terms for one another during query sessions to determine whether the terms are synonyms. However, such substitutions can create false synonyms that are not meaningful, and which lead to unrelated or non-useful query results.
Accordingly, what are needed are a method and an apparatus that identifies potential synonyms without the above-described problems.