Conventionally, most search engines use a link structure among web pages to compute a measure of the importance of each web page, which is generally considered when determining items such as, for example, which web pages will be displayed for a given search query and/or an order of the query results. Typically, the idea is as follows: if a web page has many links from other web pages, then the first web page is most likely an important one. By applying this idea iteratively and recursively, one can compute a score for each web page that is representative of the importance of the page. Two of the best-known algorithms for this purpose are the PageRank algorithm and the hubs and authorities algorithm. In the PageRank algorithm, each web page gets a PageRank score which is equal to the stationary probability of that node or vertex (e.g., web page) in a certain random process: a uniform random walk on the web graph with a restarting probability that is uniform on all nodes of the graph. The PageRank of a web page, v, can be viewed as the sum of the individual contributions to v from each of the other web pages in the graph. Specifically, the contribution of a web page u to the PageRank of a web page v is defined to be the value of the page v in the personalized PageRank vector of the page u.
In many settings, it is important to find the set of web pages that contribute the most to the PageRank of a given page. For example, one difficulty that confronts today's search engines is a malicious and/or fraudulent activity known as “link spam” or “web spam”, whereby the rank of a web page assigned by a search engine is increased by manipulating link structure rather than by improving the content of the web page or its appeal to users. For example, many ad hoc yet independent web pages can be created that contain links to one another. As many of these ad hoc web pages can have a large number of other (also potentially ad hoc) web pages that link to the page, conventional search engines are prone to rank such web pages more highly than is otherwise warranted. Today, the most common way to detect web spam is based on the content of the web page, yet such a method can be costly and inefficient.
Efficiently detecting link spam has become increasingly important in maintaining the integrity of search engines. Given one suspicious webpage, one needs a method to quickly identify a set of pages that contributes significantly to the PageRank of that suspicious page, as well as the respective PageRanks of the set of pages to which the suspicious page contributes significantly. We refer to former as the contribution set or the supporting set, and to the latter as the influence set of the suspicious page. Given that the web graph (e.g., a directed graph representative of the entire web) is massive and getting larger at a substantial rate, it can be essential to find these supporting and influence sets by examining as small a fraction of the full graph as possible.