It has become common for users of host computers connected to the World Wide Web (the “Web”) to employ Web browsers and search engines to locate Web pages having specific content of interest to users. A search engine, such as Digital Equipment Corporation's Alta Vista search engine, indexes hundreds of millions of Web pages maintained by computers all over the world. The users of the hosts compose queries, and the search engine identifies pages that match the queries, e.g., pages that include key words of the queries. These pages are known as a “result set.”
In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, thousands of pages. The pages in the result set may or may not satisfy the user's actual information needs. Therefore, techniques have been developed to identify a smaller set of related pages.
In one prior art technique used by the Excite search engine, please see “http://www.excite.com,” users first form an initial query, using the standard query syntax for the Excite search engine that attempts to specify a topic of interest. After the result set has been returned, the user can use a “Find Similar” option to locate related pages. However, there the finding of the related pages is not fully automatic because the user first is required to form a query, before related pages can be identified. In addition, that technique only works on the Excite search engine and for the specific subset of Web page provides related pages that are indexed by the Excite search engine.
In another prior art technique, an algorithm for connectivity analysis of a neighborhood graph (n-graph) is described by Kleinberg in “AuthoratativeAuthoritative Sources in a Hyperlinked Environment,” Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, and also in IBM Research Report RJ 10076, May 1997, see, “http://www.cs.cornell.edu/Info/People/kleinber/auth.ps”. The Kleinberg algorithm analyzes the link structure, or connectivity of Web pages “in the vicinity” of the result set to suggest useful pages in the context of the search that was performed.
The vicinity of a Web page is defined by the hyperlinks that connect the page to others. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages are indirectly linked via intermediate pages. This connectivity can be expressed as a graph where nodes represent the pages, and the directed edges represent the links. The vicinity of all the pages in the result set, up to a certain distance, is called the neighborhood graph.
Specifically, the Kleinberg algorithm attempts to identify “hub” pages and “authority” pages in the neighborhood graph for a user query. Hubs and authorities exhibit a mutually reinforcing relationship.
The Kleinberg paper cited above also describes an algorithm that can be used to determine related pages by starting with a single page. The algorithm works by first finding a set of pages that point to the page, and then running the base algorithm on the resulting graph. However, this algorithm for finding related pages differs from our invention in that it does not deal with popular URLs, with neighborhood graphs containing duplicate pages, or with cases where the computation is totally dominated by a single “hub” page, nor does the algorithm include an analysis of the contents of pages when it is computing the most related pages.
The CLEVER Algorithm is a set of extensions to Kleinberg's algorithm, see S.Chakrabarti et al, “Experiments in Topic Distillation,” ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, 1998. The goal of the CLEVER algorithm is to distill the most important sources of information from a collection of pages about a topic.
In U.S. patent application Ser. No. 09/007,635 “Method for Ranking Pages Using Connectivity and Content Analysis” filed by Bharat et al. on Jan. 15, 1998, a method is described that examines both the connectivity and the content of pages to identify useful pages. However, the method is relatively slow because all pages in the neighborhood graph are fetched in order to determine their relevance to the query topic. This is necessary to reduce the effect of non-relevant pages in the subsequent connectivity analysis phase.
In U.S. patent application Ser. No. 09/058,577 “Method for Ranking Documents in a Hyperlinked Environment using Connectivity and Selective Content Analysis” filed by Bharat et al. on Apr. 9, 1998, a method is described which performs content analysis on only a small subset of the pages in the neighborhood graph to determine relevance weights, and pages with low relevance weights are pruned from the graph. Then, the pruned graphed is ranked according to a connectivity analysis. This method still requires the result set of a query to form a query topic.
Therefore, there is a need for a method for identifying related pages in a linked database that does not require a query and the fetching of many unrelated pages.