The World Wide Web has given computer users on the Internet access to vast amounts of information in the form of billions of Web pages. Each of these pages can be accessed directly by a user typing the URL (universal resource locator) of a web page into a web browser on the user's computer, but often, a person is more likely to access a website by finding it with the use of search engine. A search engine allows a user to input a search query made up of words or terms that a user thinks will be used in the web pages containing the information he or she is looking for. The search engine will attempt to match web pages to the terms in the search query and will then return the located web pages to the user.
The search results generated from a user's query typically consist of a collection of document surrogates, each of which contains summary information, attributes, and other meta-data about the matched documents. These documents surrogates are often presented in a simple list-based format, displaying the title of the document, a snippet containing. the query terms in context, and the uniform resource locator (the URL). A user can then select one of the returned entries to view the corresponding web page.
With the continued growth of web pages available on the Internet making the task of search engines more and more difficult, web search engines have greatly increased the size of their indexes and made significant advances in the algorithms used to match a user's query to these indexes. However, while it is clear that significant effort has gone into creating web search engines that can index billion of documents and return the search results in a fraction of a second, this has resulted in the creation of the problem of search queries returning numerous results.
While relevant documents might be present in the search results returned from a search engine, often the returned search results consist of tens or hundreds of individual documents making it hard for a user to determine which of the search results may or may not be relevant to the information the user is looking for.
While information retrieval techniques used by web search engines have improved substantially over the years, the search results are still typically represented in a simple list-based format. Although this list-based representation makes it easy to evaluate a single document, it does not support the users in the broader tasks of manipulating the search results, comparing documents, or finding a set of relevant documents. Even though this simple list-based representation provides the search results in a clear and effective manner for determining the relevance of individual document surrogates, it requires that each document surrogate be evaluated in turn, and to some degree, in the order provided. If hundreds of documents are returned it is inefficient if not completely impractical to have a user review hundreds of results to determine the most relevant documents located in the search. Requiring users to evaluate each document surrogate individually, often with only ten documents per page, leads to a common user search trait of evaluating only a few pages of search results before either re-formulating their query or giving up.
One solution that can be used to address these numerous search results is for the user to reformulate his or her search query to narrow the search with the result that fewer document are located matching the search query, however, in many cases there may be high quality relevant documents buried in the search results set that were missed because the users did not look at enough search result pages.
Another method that has also been used is to cluster the search results such that documents that are similar to one another are grouped together. In such a system, a user navigates the clusters in order to narrow down the search results and avoid clusters of irrelevant documents. Ideally, the user will select the relevant clusters and view lists of the returned documents in which a large portion are relevant to the requirements of the user.
One of the problems with these systems is that determining what the clusters should be centered around and determining an adequate description of the cluster. If the information does not correctly describe the document contained in the cluster, a user may either choose clusters that are not relevant or entirely miss clusters that may contain relevant documents.