The present invention relates to techniques for computing authority of documents on the World Wide Web and, in particular, to techniques for taking user behavior into account when computing PageRank.
PageRank is an advanced and well researched Web technology that spans over a variety of fields from data compression to linear algebra. Conventional PageRank computes authority weights of different HTML pages based on a random surfer model. In this model a steady-state distribution of the Markov chain is computed based on a transition matrix defined by a surfer that uniformly randomly follows the page out-links. To meet certain mathematical requirements (i.e., the Perron-Frobenius Theorem) a blend of such a random surfer with uniform “teleportation” is typically used. In such an approach, a surfer either follows a random out-link with probability c, or “gets bored” and starts a new session jumping to a uniformly randomly selected page with probability 1−c. Thus the term teleportation.
According to a conventional formulation, PageRank can be introduced as a vector defined over all nodes of a Web graph that satisfies the following PageRank linear systemp=cPTp+(1−c)v.  (1)Here P is a Markov transition matrix in which
      P    ij    =      {                                                      1              /                              deg                ⁡                                  (                  i                  )                                                                                                        if                ⁢                                                                  ⁢                there                ⁢                                                                  ⁢                is                ⁢                                                                  ⁢                link                ⁢                                                                  ⁢                i                            ->              j                                                                          0              ,                                                                          if                ⁢                                                                  ⁢                there                ⁢                                                                  ⁢                is                ⁢                                                                  ⁢                no                ⁢                                                                  ⁢                link                ⁢                                                                  ⁢                i                            ->              j                                          ,      c is a teleportation coefficient usually picked around 0.85-0.9, v=(1/n,1/n, . . . ,1/n) is a uniform teleportation vector, and n is a total number of all Web pages. The system can be rewritten in a more straightforward component-wise way that explicitly uses Web graph structure (deg(i) is out-degree of node i)
                              p          j                =                              c            ⁢                                          ∑                                  i                  ->                  j                                            ⁢                                                          ⁢                                                p                  i                                /                                  deg                  ⁡                                      (                    i                    )                                                                                +                                    (                              1                -                c                            )                        ⁢                          v              j                                                          (        2        )            
Many iterative methods of solving PageRank equation (1) have been proposed. For an introduction to this subject see A Survey on “PageRank” Computing, P. Berkhin, Internet Mathematics, Vol. 2, No 1., pp. 73-120, 2005, incorporated herein by reference in its entirety for all purposes. And though the numerical properties of PageRank are relatively well studied, the usefulness of conventional formulations of PageRank in the relevancy ranking of query search results (one of its primary uses) is debatable. This is due in large part to the fact that some of the basic assumptions underlying widely used PageRank formulations are either flawed or not reflective of reality. Indeed, this fact is evidenced in the many attempts which have been made to adjust PageRank formulations to more realistic settings from the time of its introduction.
For example, the assumption that all the outgoing links in a Web page are followed by a random surfer uniformly randomly is unrealistic. In reality, links can be classified into different groups, some of which are followed rarely if at all (e.g., disclaimer links). Such “internal links” are known to be less reliable and more self-promotional than “external links” yet are often weighted equally. Attempts to assign weights to links based on IR similarity measures have been made but are not widely used. See, for example, The Intelligent Surfer. Probabilistic Combination of Link and Content Information in PageRank, M. Richardson and P. Domingos, Advances in Neural Information Processing Systems 14, MIT Press, 2002.
The uniform teleportation jump to all the Web pages is another example of an unrealistic assumption upon which conventional PageRank formulations are based. That is, nothing is further from reality than the assumption that users begin new sessions on major portals and obscure home pages with equal probability. Alternatively, it is sometimes assumed that teleportation is restricted to a trusted set of pages or sites. See, for example, Combating Web Spam with TrustRank, Z. Gyongyi, H. Garcia-Molina, J. Pedersen, In Proceedings of 30th VLDB Conference, Toronto, Canada, ACM Press, 2004. However, this assumption is equally flawed in that it is intended to combat link spam rather than being reflective of real-world user behavior. An additional and less recognized problem is that attrition from different pages is very different and therefore cannot accurately be described by the same scalar coefficient 1−c.
Conventional PageRank formulations have another issue which relates to the manner in which they are used in practice. That is, because of the vast number of pages on the Web, PageRank computing is typically implemented with regard to aggregations of pages by site, host, or domain, also referred to as “blocked” PageRank. See, for example, Exploiting the Block Structure of the Web for Computing PageRank, S. Kamvar, T. Haveliwala, C. Manning, G. Golub, Stanford University Technical Report, 2003. In formulating viable blocked PageRank computations, links between pages have to be somehow aggregated to a block level. Unfortunately, most heuristics for performing this aggregation do not work well.
In view of the foregoing, new formulations of PageRank are needed which address these shortcomings.