The present invention relates to techniques for analyzing linked documents. More specifically, the invention relates to techniques for analyzing topics characterizing the linked documents, including considering the topic coherency of the documents as a part of the analysis.
The idea of a standard uniform pagerank methodology of linked documents is well understood. For example, see “The PageRank Citation Ranking, Bringing Order to the Web” 1998, by Larry Page, Sergey Brin, R. Motwani and T. Winograd (“Page 1998”). The standard uniform pagerank methodology refers to the computation of a single authority score per “page” (document). This single score for a particular page is an indication, based on links to the particular page from all the other pages in the set (e.g., accessible on the World Wide Web), of the overall relevance of the particular page.
In a uniform pagerank, a link to the particular page from a linking page is an indication of relevance of that page by the linking page. However, the indication of relevance of the particular page is reduced where the linking page links to pages other than the particular page, and the amount of reduction is dependent on the number of links to pages other than the particular page. This is known as “random jump” model, i.e., the probability that a random page “surfer” will get bored and jump to any page at random. Furthermore, the amount of relevance of a link from a linking page to the particular page is dependent on the relevance of the linking page as determined by all the other pages. This is known as an “authority score.”
In a typical example, then, as described in the Page 1998 paper, standard uniform pagerank methodology is implemented using an iterative algorithm. FIG. 1 illustrates such an iterative algorithm in a very simplistic manner. Prior to the first iteration, a page 102 has an authority score of Score(0). After the first iteration, the same page has an authority score Score(1). After the second iteration, the same page has an authority score Score(2). After the third through Nth iteration, the same page has an authority score Score(N).
A variation on of standard uniform pagerank methodology is standard topic pagerank, which employs the computation of different pagerank scores for different topics. Each topic pagerank score is independent (i.e., restricts the set of pages to which the random surfer can jump to only those characterized by the topic), even though the computations for all the topics might run in parallel for efficiency of implementation. The authority score for a particular topic for a particular page is independent of the authority score for another topic for the same particular page (and, for that matter, for any other page).
Similar to FIG. 1, FIG. 2 simplistically illustrates an iterative algorithm for standard topic pagerank. Prior to the first iteration, each topic 102(a) through 102(d) of the page 102 has a topic-specific authority score of Score(a0) through Score(d0), respectively. After the first iteration, the same topics for the same page have a topic-specific authority score of Score(a1) through Score(d1), respectively. After the second iteration, the same topics for the same page have a topic-specific authority score of Score(a2) through Score(d2), respectively. After the third through Nth iteration, the same topics for the same page have a topic-specific authority score of Score(aN) through Score(dN), respectively.