This invention relates generally to computerized information retrieval, and more particularly to identifying related pages in a hyperlinked database environment such as the World Wide Web.
It has become common for users of host computers connected to the World Wide Web (the xe2x80x9cWebxe2x80x9d) to employ Web browsers and search engines to locate Web pages having specific content of interest to users. A search engine, such as Digital Equipment Corporation""s AltaVista search engine, indexes hundreds of millions of Web pages maintained by computers all over the world. The users of the hosts compose queries, and the search engine identifies pages that match the queries, e.g., pages that include key words of the queries. These pages are known as a xe2x80x9cresult set.xe2x80x9d
In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, thousands of pages. The pages in the result set may or may not satisfy the user""s actual information needs. Therefore, techniques have been developed to identify a smaller set of related pages.
In one prior art technique used by the Excite search engine, please see xe2x80x9chttp://www.excite.com,xe2x80x9d users first form an initial query, using the standard query syntax for the Excite search engine that attempts to specify a topic of interest. After the result set has been returned, the user can use a xe2x80x9cFind Similarxe2x80x9d option to locate related pages. However, there the finding of the related pages is not fully automatic because the user first is required to form a query, before related pages can be identified. In addition, that technique only works on the Excite search engine and for the specific subset of Web pages, it provides related pages that are indexed by the Excite search engine.
In another prior art technique, an algorithm for connectivity analysis of a neighborhood graph (n-graph) is described by Kleinberg in xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, and also in IBM Research Report RJ 10076, May 1997, see, xe2x80x9chttp://www.cs.cornell.edu/Info/People/kleinber/auth.psxe2x80x9d. The Kleinberg algorithm analyzes the link structure, or connectivity of Web pages xe2x80x9cin the vicinityxe2x80x9d of the result set to suggest useful pages in the context of the search that was performed.
The vicinity of a Web page is defined by the hyperlinks that connect the page to others. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages are indirectly linked via intermediate pages. This connectivity can be expressed as a graph where nodes represent the pages, and the directed edges represent the links. The vicinity of all the pages in the result set, up to a certain distance, is called the neighborhood graph.
Specifically, the Kleinberg algorithm attempts to identify xe2x80x9chubxe2x80x9d pages and xe2x80x9cauthorityxe2x80x9d pages in the neighborhood graph for a user query. Hubs and authorities exhibit a mutually reinforcing relationship.
The Kleinberg paper cited above also describes an algorithm that can be used to determine related pages by starting with a single page. The algorithm works by first finding a set of pages that point to the page, and then running the base algorithm on the resulting graph. However, this algorithm for finding related pages differs from our invention in that it does not deal with popular URLs, with neighborhood graphs containing duplicate pages, or with cases where the computation is totally dominated by a single xe2x80x9chubxe2x80x9d page, nor does the algorithm include an analysis of the contents of pages when it is computing the most related pages.
The CLEVER Algorithm is a set of extensions to Kleinberg""s algorithm, see S.Chakrabarti et al, xe2x80x9cExperiments in Topic Distillation,xe2x80x9d ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, 1998. The goal of the CLEVER algorithm is to distill the most important sources of information from a collection of pages about a topic.
In U.S. patent application Ser. No. 09/007,635 xe2x80x9cMethod for Ranking Pages Using Connectivity and Content Analysisxe2x80x9d filed by Bharat et al. on Jan. 15, 1998, a method is described that examines both the connectivity and the content of pages to identify useful pages. However, the method is relatively slow because all pages in the neighborhood graph are fetched in order to determine their relevance to the query topic. This is necessary to reduce the effect of non-relevant pages in the subsequent connectivity analysis phase.
In U.S. patent application Ser. No. 09/058,577 xe2x80x9cMethod for Ranking Documents in a Hyperlinked Environment using Connectivity and Selective Content Analysisxe2x80x9d filed by Bharat et al. on Apr. 9, 1998, now U.S. Pat. No. 6,112,203, a method is described which performs content analysis on only a small subset of the pages in the neighborhood graph to determine relevance weights, and pages with low relevance weights are pruned from the graph. Then, the pruned graphed is ranked according to a connectivity analysis. This method still requires the result set of a query to form a query topic.
Therefore, there is a need for a method for identifying related pages in a linked database that does not require a query and the fetching of many unrelated pages.
Provided is a method for identifying related pages among a plurality of pages in a linked database such as the World Wide Web. An initial page is selected from the plurality of pages by specifying the URL of the page or clicking on the page using a Web browser in a convenient manner.
Pages linked directly or indirectly to the initial page are represented as a neighborhood graph in a memory. The pages represented in the graph are scored on content using a similarity measurement using a topic extracted from a chosen subset of the represented pages.
A set of pages is selected from the pages in the graph, the selected set of pages having scores greater than a first predetermined threshold and do not belong to a predetermined list of xe2x80x9cstop URLs.xe2x80x9d Stop URLs are highly popular, general purpose sites such as search engines. The selected set of pages is then scored on connectivity, and a subset of the set of pages that have scores greater than a second predetermined threshold are selected as related pages. Finally, during an optional pass, content analysis can be done on highly ranked pages to determine which pages have high content scores.