1. Field of the Invention
The present invention generally relates to techniques for ranking pages on the web. More specifically, the present invention relates to a method for producing a ranking for pages on the web by computing shortest distances from a set of seed pages to each of the pages to be ranked, wherein the seed pages and the pages to be ranked are interconnected with links.
2. Related Art
The relentless growth of the Internet has been largely fueled by the development of sophisticated search engines, which enable users to comb through billions of web pages looking for specific pages of interest. Because a given query can return millions of search results it is important to be able to rank these search results to present high-quality results to the user.
A popular search engine developed by Google Inc. of Mountain View, Calif. uses PageRank® as a page-quality metric for efficiently guiding the processes of web crawling, index selection, and web page ranking. Generally, the PageRank technique computes and assigns a PageRank score to each web page it encounters on the web, wherein the PageRank score serves as a measure of the relative quality of a given web page with respect to other web pages. PageRank generally ensures that important and high-quality web pages receive high PageRank scores, which enables a search engine to efficiently rank the search results based on their associated PageRank scores.
PageRank scores are computed based on the web link-graph structure, wherein the web pages are the nodes of the link-graph which are interconnected with hyperlinks. In this model, PageRank R for a given web page p can be computed as:
                              ∀                      p            ∈            P                          ,                              R            ⁡                          (              p              )                                =                                    (                              1                -                d                            )                        +                          d              ⁢                                                ∑                                      q                    →                    p                                                                                                          ⁢                                                      R                    ⁡                                          (                      q                      )                                                                                                                        q                                                              out                                                                                      ,                            (        1        )            wherein P is the set of all the web pages, |q|out is the out-degree of a specific page q in the set P, and 0≦d≦1 is a damping factor.
However, the simple formulation of Equation (1) for computing the PageRank is vulnerable to manipulations. Some web pages (called “spam pages”) can be designed to use various techniques to obtain artificially inflated PageRanks, for example, by forming “link farms” or creating “loops.”
One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages. For example, the technique can use a set of high quality seed pages (s1, s2, sn), and for each seed page i=1, 2, . . . , n, the system can iteratively compute the PageRank scores for the set of the web pages P using the formulae:
                              ∀                                                    s                i                            ≠              p                        ∈            P                          ,                                            R              i                        ⁡                          (              p              )                                =                      d            ⁢                                          ∑                                  q                  →                  p                                                                                              ⁢                                                                                          R                      i                                        ⁡                                          (                      q                      )                                                                                                                        q                                                              out                                                  ⁢                                  w                  ⁡                                      (                                          q                      →                      p                                        )                                                                                      ,                            (        2        )            where Ri(si)=1, and w(q→p) is an optional weight given to the link q→p based on its properties (with the default weight of 1).
Generally, it is desirable to use large number of seed pages to accommodate the different languages and a wide range of fields which are contained in the fast growing web contents. Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.
Hence, what is needed is a method and an apparatus for producing a ranking for pages on the web using a large number of diversified seed pages without the problems of the above-described techniques.