A. Field of the Invention
The present invention relates generally to information location and, more particularly, to search engines that locate information on a network, such as the World Wide Web.
B. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web research are growing.
People generally surf the web based on its link graph structure, often starting with high quality human-maintained indices or search engines. Human-maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and do not cover all esoteric topics.
Automated search engines, in contrast, locate web sites by matching search terms entered by the user to a pre-indexed corpus of web pages. Generally, the search engine returns a list of web sites sorted based on relevance to the user's search terms. Determining the correct relevance, or importance, of a web page to a user, however, can be a difficult task. For one thing, the importance of a web page to the user is inherently subjective and depends on the user's interests, knowledge, and attitudes. There is, however, much that can be determined objectively about the relative importance of a web page. Existing methods of determining relevance are based on matching a user's search terms to terms indexed from web pages. More advanced techniques determine the importance of a web page based on more than the content of the web page. For example, one known method, described in the article entitled “The Anatomy of a Large-Scale Hypertextual Search Engine,” (1998) by Sergey Brin and Lawrence Page, assigns a degree of importance to a web page based on the link structure of the web page.
A search query, entered by a user is typically only one query of many that express the information that the user desires. For example, someone looking to buy replacement parts for their car may pose the search query “car parts.” Alternatively, however, the search queries “car part,” “auto parts,” or “automobile spare parts” may be as effective or more effective in returning related documents. In general, a user query will have multiple possible alternative queries that could be helpful in returning documents that the user considers relevant.
Conventionally, additional search queries relating to an initial user query may be automatically formed by the search engine based on different forms of a search term (e.g. “part” or “parts”) or based on synonyms of a search term (e.g., “auto” instead of “car”). This allows the search engine to find documents that do not contain exact matches to the user's search query but that are nonetheless relevant.
One known technique that finds alternative search queries related to the search query entered by the user is based on the concept of stems. A word stem is an underlying linguistic form from which the final form was derived by morphological linguistic processes. Techniques for identifying the stem of a word are well known in the field of computational linguistics. One such technique is described by Porter, M. F., 1980, An Algorithm For Suffix Stripping, Program, Vol 14(3):130-137. Words with the same stem, such as congress and congressional, tend to describe similar concepts. Stemming allows a search engine to match a query word to various morphological variants of that word. The search engine can use each of these variants in formulating the search query.
A second known technique that finds alternative search queries related to the search query entered by the user is based on the matching of query terms to their synonyms (e.g., car to automobile). The synonyms may be determined by looking up the terms in a thesaurus.
One serious problem with the stem-based and synonym-based techniques for finding additional search queries is that two words may have similar semantics in some contexts, but not in other contexts. For example, “automobile” has similar semantics to “car” in the query “Ford car”, but not in the query “railroad car.” As a result, these techniques often produce search queries that generate irrelevant results. For another example, if the query “jaguars” was stemmed to the word “jaguar,” the query semantics may have been changed from that of animal to that of a popular car.
A third known technique that finds alternative search queries related to the search query entered by the user is based on finding additional terms that occur frequently in documents matching the original query, and adding one or more of the additional terms to the query. One serious problem with this technique is that it may introduce terms that change the focus of the query. For example, the word “drive” may be present in many documents matching the query “Ford car”, but it would not be an appropriate addition to the query.
Accordingly, it would be desirable to more effectively expand search queries to find alternate search terms that encompass the semantic intent of the original search query without unduly changing its focus.