Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
One well-known technique for page ranking is PageRank, which is based on the principle that web pages will have links to (i.e., “out links”) important web pages. The importance of a web page is based on the number and importance of other web pages that link to that web page (i.e., “in links”). PageRank is based on a random surfer model of visiting web pages of a web graph (vertices representing web pages and links representing hyperlinks) and represents the importance of a web page as the stationary probability of visiting that web page. In the random surfer model, a surfer visiting a current page will visit a next page by randomly selecting a link of the current web page. If the current web page has three out links to target web pages, then the probability of visiting each target web page from the current web page is ⅓. PageRank is thus based on a Markov random walk that only depends on the information (e.g., hyperlink) of the current web page.
A web graph may be represented as G=<V,E>, where V={1, 2, . . . , n} is the set of vertices and E={<i,j>|i,j ε V} is the set of edges. The links between web pages can be represented by an adjacency matrix A, where Aij is set to one when there is an out link from a source web page i to a target web page j. The importance score wj for web page j can be represented by the following:
                              w          j                =                              ∑            i                    ⁢                                    A              ij                        ⁢                          w              i                                                          (        1        )            
This equation can be solved by iterative calculations based on the following:ATw=w  (2)where w is the vector of importance scores for the web pages and is the principal eigenvector of AT.
PageRank may also factor in that a surfer may randomly select a web page to visit next that is not linked to by the current web page. Thus, the surfer may next visit a target web page of the current web page with a probability of a and next visit a randomly selected web page with a probability of 1−α. To factor in this random selection of web pages, PageRank generates an initial transition matrix P by normalizing each non-zero row of the adjacency matrix with the sum of its elements. PageRank then sets each element of a zero row in matrix P to 1/n to generate transition probability matrix P. The model of representing the random selection of links of target web pages and the random selection of web pages can be represented by the following: P=α P+(1−α)U  (3)where P is the combined transition probability matrix and U is a uniform probability distribution matrix in which each element is set to 1/n. PageRank considers the stationary distribution π=(π1, π2, . . . , πn)T of the transition probability matrix P to represent the importance of each web page. PageRank may compute the stationary distribution through an iterative process as represented by the following:π(t+1)=( P)Tπ(t)  (4)where π(0)=(1, 1, . . . , 1)nT, t represents the iteration count, and the iterative process continues until π converges on a solution.
Although PageRank can be very useful, in part because it is a query-independent measure of importance, it is especially susceptible to “link spamming.” “Spamming” in general refers to a deliberate action taken to unjustifiably increase the rank, relevance, popularity, importance, and so on of a web page or a web site. In the case of link spamming, a spammer can manipulate links to unjustifiably increase the importance of a web page. For example, a spammer may provide a web page of useful information with hidden links to spam web pages. When many web pages point to the useful information, the importance of the spam web pages is indirectly increased. As another example, many web sites, such as blogs and web directories, allow visitors to post links. Spammers can post links to their spam web pages to directly or indirectly increase the importance of the spam web pages. As another example, a group of spammers may set up a link exchange mechanism in which their web sites point to each other to increase the importance of the web pages of the spammers' web sites.
Web spam presents problems for various techniques that rely on web data. For example, a search engine service that orders search results in part based on relevance, popularity, or importance of web pages may rank spam web pages unjustifiably high because of the spamming. As another example, a web crawler may spend valuable time crawling the links of spam web sites, which increases the overall cost of web crawling and may reduce its effectiveness.
PageRank is especially susceptible to link spamming because it is based on the concept of a “returning time” of a web page. Returning time of a web page is a measure of the number of transitions needed to return to the web page starting from the web page itself. The importance score of a web page is the reciprocal of the returning time. Thus, when a web page has a small returning time, it will have a large importance score. Returning time may represented by the following:Ti+=min{t≧1:Xt=i}  (5)where Ti+ represents the returning time for web page i and Xt represents a discrete-time Markov chain of the transition probability matrix for time t. The stationary transition probability, and thus importance, can be represented by the following:πi=1/EiTi+  (6)where πi represents the stationary probability for web page i and EiTi represents the expected returning time. Since the Markov chain starts from a target page i itself, the behavior of the random walk is largely affected by the local structure around page i. Link spammers can create an arbitrary local structure around a web page to take advantage of this drawback of PageRank. A link spammer can set up a star-structured link farm in which a central web page contains links to many other boosting web pages and those boosting web pages contain links only to the central web page. In such a case, the random walk from the central web page is trapped in this star-like local structure, and the mean returning time can be significantly reduced, and therefore importance is increased. In the random walk model without a possibility of jumping to a random non-linked-to web page (α=1), all possible series of random walk transitions will be between the central web page and one of its boosting web pages. The mean returning time is only 2. Even when the possibility of transitioning to a random web page is factored in (α<1), such a link farm can significantly reduce the mean returning time and increase importance.