1. Field of the Invention
This invention relates generally to computerized information retrieval, and more particularly to identifying related pages in a hyperlinked database environment such as the World Wide Web.
2. Description of the Related Art
It has become common for users of host computers connected to the World Wide Web (the “Web”) to employ Web browsers and search engines to locate Web pages having specific content of interest to users. A search engine, such as Digital Equipment Corporation's Alta Vista search engine, indexes hundreds of millions of Web pages maintained by computers all over the world. The users of the hosts compose queries, and the search engine identifies pages that match the queries, e.g., pages that include key words of the queries. These pages are known as a “result set.” In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, thousands of pages. The pages in the result set may or may not satisfy the user's actual information needs. The vast majority of users are not interested in retrieving the entire huge set of resources. Most users will be quite satisfied with a few authoritative results which are highly relevant to the topic of the query. The challenge is to retrieve only the most relevant resources to the query.
The Web is a hyperlinked collection. In addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be tapped when searching for authoritative sources. Consider the significance of a link p. With such a link p suggests, or even recommends, that surfers visiting p follow the link and visit q. This may reflect the fact that pages p and q share a common topic of interest, and that the author of p thinks highly of q's content. Such a link, called an informative link, is p's way to confer authority on q. Note that informative links provide a positive critical assessment of q's contents which originates from outside the control of the author of q (as opposed to assessments based on q's textual content, which is under complete control of q's author).
The vicinity of a Web page is defined by the hyperlinks that connect the page to others. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages is are indirectly linked via intermediate pages. This connectivity can be expressed as a graph where nodes represent the pages, and the directed edges represent the links. The vicinity of all the pages in the result set, up to a certain distance, is called the neighborhood graph.
Specifically, the Kleinberg algorithm attempts to identify “hub” pages and “authority” pages in the neighborhood graph for a user query. Hubs and authorities exhibit a mutually reinforcing relationship. The Kleinberg algorithm determines related pages starting with a single page. The algorithm works by first finding a set of pages that point to the page, and then running the base algorithm on the resulting graph. However, this algorithm for finding related pages does not deal with popular URLs, with neighborhood graphs containing duplicate pages, or with cases where the computation is totally dominated by a single “hub” page. The algorithm also does not include an analysis of the contents of pages when it is computing the most related pages.
The Google search engine uses a feature called PageRank to prioritize the results of web keyword searches. The PageRank technique examines a single random walk on the entire Web. PageRank assumes page A has pages T1 . . . Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Also C(A) is defined as the number of links going out of page A. The PageRank (PR) of a page A is given as follows:PR(A)=(1−d)+d(PR(T1)/C(T1)+ . . . +PR(Tn)/C(Tn)) The PageRanks form a probability distribution over the web pages, so the sum of all web pages' PageRanks is one. PageRank or PR(A) corresponds to the principal eigenvector of the normalized link matrix of the web. The ranking of web sites is independent of the search query, and no distinction is made between hubs and authorities, as with the Kleinberg algorithm. There is also no provision for externally evaluating sites and using the evaluations to weigh the usefulness rankings.
Another method for ranking pages in a search result known in the art is disclosed in a paper entitled “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect”, by Ronny Lempel and Shlomo Moran, which is published on the Web at http://www9.or/w9cdrom/175/175.html.http://www9.org/w9cdrom/175/175.html The SALSA method examines random walks on graphs derived from the link structure among pages in a search result. While preserving the theme that Web sites pertaining to a given topic should be split to hubs and authorities, it replaces Kleinberg's Mutual Reinforcement method by a stochastic method, in which the coupling between hubs and authorities is less tight. The method is based on considering a bipartite graph G, whose two parts correspond to hubs and authorities, where an edge between hub r and authority s means that there is an informative link from r to s. Then, authorities and hubs pertaining to the dominant topic of the sites in G should be highly visible (reachable) from many sites in G. These sites are identified by examining certain random walks in G, under the proviso that such random walks will tend to visit these highly visible sites more frequently than other, less connected sites. The SALSA approach is based upon the theory of Markov chains, and relies on the stochastic properties of random walks performed on a collection of sites. It differs from Kleinberg's Mutual Reinforcement approach in the manner in which the association matrices are defined. The SALSA approach also initially assumes uniform probability over all pages, and relies on the random walk process to determine the likelihood that a particular page will be visited.
It is therefore desireable to provide a method for ranking the relative quality, or relevance, of pages with respect to one another, that factors in the probability of a page being viewed without requiring a random walk.