The present invention generally relates to the field of information retrieval, and more specifically to the task of identifying synonyms for words to facilitate retrieving documents in response to queries which contain the words.
The World Wide Web (web) contains a vast amount of freely available information. However, locating a relevant item of information on the web can be a challenging task. Note that this problem continues to increase as the amount of information available on the web continues to grow.
Search engines can often help users to locate and retrieve a document of interest on the web. However, users often fail to select effective query terms during the searching process. For example, a user may enter the query [web hosting+fort wayne] when the city of Fort Wayne is usually referred to as Ft. Wayne. Or, a user may enter [free loops for flash movie] when most relevant pages use the term “music,” rather than “loops” and the term “animation” rather than “movie.” Thus, documents that satisfy a user's informational needs may use different terms than the specific query terms chosen by the user to express a concept of interest. Note that this problem becomes more of an issue as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that at least one of the terms is not the best term to describe the user's informational need.
Hence, there is a need to modify and/or expand user queries to include synonyms for query terms, so that retrieved documents will better meet the user's informational needs.
Unfortunately, solving this problem has proven to be a difficult task. A simple approach is to use pre-constructed synonym information, for example from a thesaurus or a structured lexical database. However, thesaurus-based systems have various problems. For example, they are often expensive to construct, and are generally restricted to one language.
Some systems consider how often terms are substituted for each other during query sessions to determine whether the terms are synonyms. However, there does not exist enough query data for rare words and rare languages to identify synonyms in this way.
Other systems consider stemming relationships to identify synonyms. However, stemming is not always accurate. For example, the words “university” and “universal” share the same stem, but have very different meanings. Furthermore, many good synonyms are not covered by stemming, such as “wolfs” and “wolves,” or “wales” and “welsh.”
Accordingly, what is needed is a method and an apparatus that identifies potential synonyms to facilitate searching operations without the above-described problems.