A. Field of the Invention
The present invention relates generally to information location and, more particularly, to search engines that locate information on the World Wide Web.
B. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web research are growing rapidly.
People generally surf the web based on its link graph structure, often starting with high quality human-maintained indices or search engines. Human-maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and do not cover all esoteric topics.
Automated search engines, in contrast, locate web sites by matching search terms entered by the user to a pre-indexed corpus of web pages. Generally, the search engine returns a list of web sites sorted based on relevance to the user's search terms. Determining the correct relevance, or importance, of a web page to a user, however, can be a difficult task. For one thing, the importance of a web page to the user is inherently subjective and depends on the user's interests, knowledge, and attitudes. There is, however, much that can be determined objectively about the relative importance of a web page. Conventional methods of determining relevance are based on matching a user's search terms to terms indexed from web pages. More advanced techniques determine the importance of a web page based on more than the content of the web page. For example, one known method, described in the article entitled “The Anatomy of a Large-Scale Hypertextual Search Engine,” by Sergey Brin and Lawrence Page, assigns a degree of importance to a web page based on the link structure of the web page.
Multiple search terms entered by a user are often more useful if considered by the search engine as a single compound unit. Assume that a user enters the search terms “baldur's gate download.” The user intends for this query to return web pages that are relevant to the user's intention of downloading the computer game called “baldur's gate.” Although “baldur's gate” includes two words, the two words together form a single semantically meaningful unit. If the search engine is able to recognize “baldur's gate” as a single semantic unit, called a compound herein, the search engine is more likely to return the web pages desired by the user.
For example, one application for compound units in a search engine might be to modify the ranking component of the search engine, so that documents containing the compound are considered more relevant than documents that contain the individual words but not the compound.
Another application may be to suggest alternate queries that either extends, shortens, or replaces words in the current query in minor ways based on prior queries logged by the system. To be useful, such an application should suggest semantically meaningful alternatives. In the above “baldur's gate” example, a semantically meaningful alternative may be “baldur's gate reviews” (i.e., written reviews of the game).
Conventionally, the identification of compounds in queries has focused on identifying compounds based on a list of previously identified compounds and statistics describing the relative frequency of occurrence of the compounds. Two approaches have commonly been used to construct such a list of compounds.
The first approach involves extracting the compound from the corpus of documents. In this approach, the documents are processed and word sequences that occur with a frequency that is statistically significant are identified as compounds. The disadvantage with this approach is that it is inefficient, because there are many more compounds in the corpus than would typically occur in user queries. Thus, only a small fraction of the detected compounds are useful in practice. This is particularly true in a highly multi-lingual and diverse corpus such as the web. Identifying all compounds on the web is computationally difficult and would require considerable amounts of storage. Additionally, determining when a compound is statistically significant can be problematic. Many compounds of interest, e.g., names, may occur relatively infrequently, thus making it hard to accumulate a statistically significant sample.
The second approach involves extracting compounds from a query log. This technique is similar to the above-discussed first approach, except that compounds are extracted from a log of past user queries instead of from the corpus of web documents. A disadvantage associated with finding compounds in query logs using statistical techniques is that word sequences occurring in query logs may not correspond to compounds in the documents. This is because queries, especially on the web, tend to be abbreviated forms of natural language sequences. For example, the words “mp3” and “download” may occur together often in query logs but “mp3 download” may not occur as a compound in a document.
A disadvantage of both corpus and query log based techniques, and indeed of any technique relying purely on previously detected compounds and on statistics to segment a query, is that they tend to ignore the meaning of the query. Such techniques may identify a compound that is not consistent with the meaning of the query, which can negatively impact applications that rely on the compound as being a semantic unit within the query.
For example, the queries “country western mp3” and “leaving the old country western migration” both have the words “country” and “western” next to each other. Only for the first query, however, is “country western” a representative compound. Segmenting such queries correctly requires some understanding of the meaning of the query. In the second query, the compound “western migration” is more appropriate, although it occurs less frequently in general.
Thus, there is a need in the art to be able to more accurately identify compounds that correspond to a semantically meaningful unit.