1. Field of the Invention
The present invention generally relates to the field of information retrieval, and more specifically to the task of identifying synonyms for a query phrase to facilitate retrieving documents for queries which contain the query phrase.
2. Related Art
The World Wide Web (web) contains a vast amount of freely available information. However, locating a relevant item of information on the web can be a challenging task. Note that this problem continues to increase as the amount of information available on the web continues to grow.
Search engines can often help users to locate and retrieve a document of interest on the web. However, users often fail to select effective query terms during the searching process. For example, a user may enter the query [web hosting+fort wayne] when the city of Fort Wayne is usually referred to as Ft. Wayne. Or, a user may enter [free loops for flash movie] when most relevant pages use the term “music,” rather than “loops” and the term “animation” rather than “movie.” Thus, documents that satisfy a user's informational needs may use different terms than the specific query terms chosen by the user to express a concept of interest. Note that this problem becomes more of an issue as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that at least one of the terms is not the best term to describe the user's informational need.
Hence, there is a need to modify and/or expand user queries to include synonyms for query terms, so that retrieved documents will better meet the user's informational needs.
Unfortunately, solving this problem has proven to be a difficult task. A simple approach is to use pre-constructed synonym information, for example from a thesaurus or a structured lexical database. However, thesaurus-based systems have various problems. For example, they are often expensive to construct, and are generally restricted to one language.
A more significant issue is that the applicability of a synonym to a given phrase often strongly depends on the context in which the phrase is used. For example, the term “music” is not usually a good synonym for the term “loops,” but it is a good synonym in the context of the example above. However, the context in the example above is sufficiently uncommon that the term “music” is not listed as a synonym for the term “loop” in standard thesauruses. Note that many other examples of contextually dependent non-traditional synonyms can be identified. Hence, even if conventional synonyms can be identified for a term, it may be difficult to identify specific synonyms to use in the context of a specific query.
Other conventional approaches for identifying synonyms cluster “related words.” Such approaches suffer from the drawback that related words are not necessarily synonyms. For example, the words “sail” and “wind” would likely be clustered together (because they co-occur in numerous documents); however they are not synonymous. Hence, substituting one for the other is likely to lead to undesirable search results.
However, sometimes two or more terms in a query phrase may be linked, e.g. because of language agreement rules, so that multiple terms in the query typically change simultaneously. A system that analyzes changes of individual words (or unigrams) in a given context may not detect synonym mappings that encompass such simultaneous changes for multiple words.
Furthermore, some existing approaches for identifying a multi-term synonym for a query phrase may lead to undesirable search results. For instance, in some existing approaches for identifying synonyms, synonyms which are generated automatically for a multi-term query phrase can potentially drop important query terms, and as a result can produce overly-general search results.
Accordingly, what is needed is a method and an apparatus that identifies synonyms for query terms and/or query phrases without the problems described above.