The World Wide Web has given computer users on the Internet access to vast amounts of information in the form of billions of Web pages. Each of these pages can be accessed directly by a user typing the IP address or URL (universal resource locator) of a web page into a web browser on the user's computer, but often, a person is more likely to access a website by finding it with the use of a search engine. A search engine allows a user to input a search query made up of words or terms that a user thinks will be used in the web pages containing the information he or she is looking for. The search engine will attempt to match web pages to the terms in the search query and will then return the located web pages to the user. Typically, search engines return the results of the search as a list of the titles of the located Web pages, a short summary of each page, and the URL of the page. A user can then select one of the titles to view the web page.
With the continued growth of web pages available on the Internet making the task of search engines more and more difficult, web search engines have greatly increased the size of their indexes and made significant advances in the algorithms used to match a user's query to these indexes. This has allowed these search engines to perform very well when high quality queries are provided by users. High quality queries are typically queries that are quite specific and made up of terms and phrases that are commonly used in the relevant documents. High quality search queries can often result in a user being provided with many highly relevant documents in the first few pages of search results provided by the search engine.
One of the difficulties in using web search engines is in creating a high quality query. If users do not craft the queries properly, either by not being specific enough or using phrases and/or terms that do not commonly occur in the relevant documents, the query may not adequately capture the intention of the user and result in the web search engine returning results that are not very relevant to what the user is looking for. In some cases numerous matching documents may be returned, making it hard for a user to determine which of the many documents are relevant. In other cases, where too many keywords are used, few if any documents may be returned. Alternatively, a few relevant documents may be returned but they may be mixed with a relatively large number of non-relevant documents making finding these relevant documents time consuming or causing the user to give up his or her search before the relevant documents are found.
Most web search engines allow a user to refine his or her query by supporting interaction based on traditional information retrieval. Basically, most search engines provide an iterative method wherein a user can see what result were returned with an initial search query and then can try again by reformulating the query and having the web search engine return new results. The user can keep reformulating the query and going through the cycle over and over again, until the user either gets results that they are happy with or the user gives up and quits.
A number of tools have been developed that attempt to aid a user in performing better searches.
Attempts have been made at query expansion to allow a user to better refine a search query. Query expansion is the process of adding additional terms to the original query in order to improve the results retrieved by the search engine.
Some previous query expansion methods have used a thesaurus based approach. A thesaurus is constructed based on similarity of terms. Words relationships such as synonym, hypernym/hyponym and meronym/holonym relationships are used to suggest similar terms to expand the query.
Other previous query expansion methods have used top ranked documents returned by the initial search query as the knowledge base for the query expansion. In these techniques, the co-occurrence of terms are calculated using only the passages that contained the query terms, rather than the whole document.
Information retrieval of web documents poses a number of problems for previous query expansion techniques. Due to the extremely large volume of documents on the web, analysis of the entire collection is not feasible. In addition, web queries are often very short, often consisting of only two or three words. Techniques that are somewhat successful with longer search queries do not often prove to be effective with short queries.