The present invention relates to grouping (or clustering) hyperlinked documents. More specifically, the invention relates to techniques for grouping hyperlinked documents from a search so that the links to hyperlinked documents about the same (or similar) topic can be displayed together.
The World Wide Web (or “Web”) contains a vast amount of information in the form of hyperlinked documents (e.g., web pages). One of the reasons for the virtually explosive growth in the number of hyperlinked documents on the Web is that just about anyone can upload hyperlinked documents, which can include links to other hyperlinked documents. Although there is no doubt that there is a vast amount of useful information on the Web, the unstructured nature of the Web can make it difficult to find the information that is desired.
A search engine attempts to return relevant information in response to requests from users. These requests usually come in the form of queries (e.g., sets of words that are related to the desired topic). Search engines typically return a number of links to web pages, with a brief description of those pages. Because the number of pages on the Web is huge, ensuring that the returned pages are relevant to the topic the user had in mind is a central problem in web searching. Possibly the simplest and most prevalent way of doing this is to find web pages containing all or many of the words included in the query, which can be called text-based searching.
Text-based searching over the Web can be notoriously imprecise and several problems arise in its use. To begin with, usually a large number of web pages match the user's query. Displaying all of these pages to the user becomes impractical and some method of ordering these results should be used. Methods that assess and use the quality of the pages, returning only well-linked pages for example, can significantly improve the quality of the returned results. However, the returned results can still range over a number of different topics, only one of which the user had in mind.
Consider, for example, a search query including only the word “Saturn.” This query can refer to the Saturn brand of car, the planet Saturn, the Sega Saturn game system, or the Roman god Saturn. Most likely, the user is interested in only one of the above topics. However, a search engine searching for this word would come up with a mishmash of results from among all of these topics. A user interested in the Saturn car would have to wade through many irrelevant search results to get the information she desires.
One solution to this problem utilizes user feedback. After the user notices that the results include links from a number of different topics, the user can narrow or focus the search to only the one topic in which the user is interested. For example, the user could add the word “car” to the query. This solution has at least two negative aspects. First, it can exclude high quality relevant pages, which include the word Saturn but not “car” (e.g., uses “automobile” instead). More importantly, it forces the user to go through extra hoops in specifying her query. The user is in fact forced to think like a textual search engine and attempt to guess what words would be included in the topic she is interested in and not in other topics matching her original query. Guessing at the search engine's behavior can become an iterative and counterproductive nightmare.
As the size of the Web continues to increase, it becomes increasingly more desirable to have innovative techniques for efficiently grouping hyperlinked documents by topic. By displaying the links to web pages grouped by topic, a search engine can provide some coherence (i.e., not jump from one topic to another and then back again) to the search results. Additionally, it would be beneficial to have techniques that are efficient at grouping search results by topic by analyzing the link structure of the Web.