The proliferation of the World Wide Web has made enormous amounts of information available through the Internet, and numerous search engines are available to help users sort through the information. For example, a user will choose a search service and then enter a query. The search service accepts the query and returns a result list of documents or links that satisfy the query. It is desirable that the list of results be ordered such that documents and/or links that are most relevant to the user's query appear first, and search engines typically include one or more algorithms to provide some sort of ranking of the search results for the user.
Ranking algorithms may be classified as query-dependent (also called dynamic) or query-independent (also called static). Query-dependent ranking algorithms use the terms in the query while query-independent algorithms do not. Query-independent ranking algorithms assign a quality score to each document on the web. Therefore, query-independent ranking algorithms can be run ahead of time and need not be rerun whenever a user submits a query.
Ranking algorithms may also be broadly classified into content-based, usage-based, and link-based ranking algorithms. Content-based ranking algorithms use the words in a document to rank the document among other documents. For example, a query-dependent content-based ranking algorithm could assign higher scores to documents that contain the query terms in the beginning of a document or in a prominent font. Usage-based ranking algorithms assign a score based on an estimate of how often the documents are viewed, for example, by examining web proxy logs or monitoring click-through on the results pages of the search engine. Link-based ranking algorithms use the hyperlinks between web pages to rank web pages. For example, a static link-based ranking algorithm could assign a score to each web page that is proportional to the number of links pointing to that page, based on the notion that links pointing to a page are actually an endorsement of the page.
PageRank® is a well-known and commonly used query-independent link-based ranking algorithm. Assume that the set of known web pages and links between them induces a graph with vertex set V, where each vertex corresponds to a web page, and edge set E, where each directed edge (u,v) corresponds to a hyperlink from page u to page v. Let O(u) denote the outdegree of vertex u, i.e., the number of hyperlinks embedded in web page u, and let d be a number between 0 and 1, e.g., 0.15. The PageRank vector R is a vector whose values R(v) satisfy the following equation, which are normalized to have a total sum of 1. Note that a page having an outdegree of 0 will need to be handled as a special case.
      R    ⁡          (      v      )        =            d                      V                      +                  (                  1          -          d                )            ⁢                        ∑                                    (                              u                ,                v                            )                        ∈            E                          ⁢                              R            ⁡                          (              u              )                                            O            ⁡                          (              u              )                                          
The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm. With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page. If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.
PageRank scores may be used to rank query results. A search engine employing PageRank will rank pages with high PageRank scores higher than those with low scores, assuming that everything else is the same. Since most users of search engines tend to examine only the first few results, operators of commercial web sites would certainly prefer that links to their sites appear early in the result listing, that is, that their web pages receive high PageRank scores. Thus, commercial web site operators clearly have an incentive to try and artificially increase the PageRank scores of the pages on their web sites.
One way to increase the PageRank score of a web page v is by having many other pages link to it. If all of the pages that link to web page v have low PageRank scores, each individual page would appear to contribute very little to the PageRank score of page v. However, since every linking page is guaranteed to have a minimum PageRank score of d/|V|, links from many such low quality pages can still inflate the PageRank score.
In practice, the vulnerability of PageRank to artificial inflation of scores is being exploited by web sites that contain a very large set of pages whose only purpose is to “endorse” a main home page. Typically, these endorsing pages contain a link to the page that is to be endorsed, and one or more links to other endorsing pages. Once a web crawler has stumbled across any of the endorsing pages, it continues to download more endorsing pages since the endorsing pages link to other endorsing pages, thereby accumulating a large number of endorsing pages. This large number of endorsing pages, all of them endorsing a single page, artificially inflates the PageRank score of the page that is being endorsed.
This problem was addressed and partially solved in U.S. Patent Publication No. 2005/0060297 entitled Systems And Methods For Ranking Documents Based Upon Structurally Interrelated Information, where the PageRank technique was modified to provide resistance to link spam by giving more weight to hosts/domains/servers that contain many web pages.
However, it remains desirable to find improved query-independent link-based ranking techniques. In particular, such techniques should significantly reduce the effects of artificially created endorsement links, and reduce the incentive for creating such links for the purpose of inflating PageRank scores.