1. Field of the Invention
The present invention generally relates to the field of information retrieval, and more specifically to the problem of automatically identifying compounds (such as bigrams or more generally n-grams) in search queries to facilitate generating better search results.
2. Related Art
The World Wide Web (web) contains a vast amount of freely available information. However, locating a relevant item of information on the web can be a challenging task, and the magnitude of this problem is increasing as the amount of information available on the web continues to grow.
Search engines can often help users to locate and retrieve a document of interest on the web. They typically operate by receiving a query containing a set of search terms and then looking up web pages containing the search terms (or related terms). During this process, it is useful to know whether two words constitute a bigram. For example, the words “San Francisco” form a bigram because they typically appear together in sequence and they collectively represent a single location—the city of San Francisco. The knowledge about whether consecutive words in a query constitute a bigram can be used to effectively narrow search results. For example, if “San Francisco” is known to be a bigram and appears in a query, then the search engine can ignore web pages in which the terms “San” and “Francisco” appear separately and are not part of the bigram “San Francisco.” (Note that a bigram is a special case of an n-gram, which as also referred to as a “compound”.)
Furthermore, knowledge about bigrams can be used to “score” web pages, which facilitates returning web pages that are most likely to be of interest to a user. For example, consider the bigram “larry page”. The individual unigrams “larry” and “page” do not mean much by themselves, while the bigram “larry page” tells us much more about the meaning of a query containing this bigram. Note that while scoring web pages, we do not have to ignore pages which do not contain the bigram, we can simply promote pages which contain it. Another example is the query “real estate post office new lebanon new york”. Not every pair of consecutive terms in this query forms a bigram. In this query, only “real estate”, “post office”, “new lebanon” and “new york” form meaningful compounds.
Unfortunately, it is often hard to determine whether consecutive words in a query are part of a compound or are separate words. For example, it is hard to differentiate bigrams such as “London hotels”, which is not a good bigram, from a good bigram such as “Gallery Hotel”, which is the name of a hotel. In some cases, it is possible to examine an encyclopedia or dictionary to identify specific compounds. However, these information sources do not contain a comprehensive list of compounds, and they also might not contain compounds that came into existence recently, such as names of celebrities or new video games.
Hence, what is needed is a method and an apparatus for automatically identifying compounds.