1. Field of the Invention
Implementations described herein relate generally to information retrieval and, more particularly, to the determination and processing of queries that include terms of little significance.
2. Description of Related Art
The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly.
Search engines attempt to return hyperlinks to web pages in which a user is interested. Generally, search engines base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to provide links to high quality, relevant results (e.g., web pages) to the user based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web pages. Web pages that contain the user's search terms are considered “hits” and are returned to the user as links.
Sometimes users include more words (“extra words”) in their search queries than are required to identify the information they seek. These extra words often degrade the quality of the search results. Some search engines identify “stop words” (e.g., “a,” “the,” “of,” “is,” etc.) and ignore the stop words when they are included in a search query. Other search engines identify “common words” by analyzing a corpus based on term frequency (TF) or inverse document frequency (IDF) and ignore the common words when they are included in a search query. Sometimes these techniques fail to lead to quality search results.