Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
One well-known technique for page ranking is PageRank, which is based on the principle that web pages will have links to (i.e., “out links”) important web pages. The importance of a web page is based on the number and importance of other web pages that link to that web page (i.e., “in links”). PageRank is based on a random surfer model of visiting web pages of a web graph (vertices representing web pages and links representing hyperlinks) and represents the importance of a web page as the stationary probability of visiting that web page. In the random surfer model, a surfer visiting a current page will visit a next page by randomly selecting a link of the current web page or by randomly jumping to any web page. If the current web page has three out links to target web pages, then the transition probability of visiting each target web page from the current web page is ⅓ using a link of the current web page. The probability of jumping to any web page is typically set to equal the probability of jumping to any other web page. So, if there are n web pages, then the jumping probability is set to 1/n for each web page, referred to as a jumping vector. PageRank is thus based on a Markov random walk that only depends on the information (e.g., hyperlinks) of the current web page and the jumping probabilities.
A web graph may be represented as G=<V, E>, where V={1, 2, . . . , n} is the set of vertices and E={<i, j>|i, j ε V} is the set of edges. The links between web pages can be represented by an adjacency matrix A, where Aij is set to one when there is an out link from a source web page i to a target web page j. The importance score wj for web page j can be represented by the following:wj=ΣiAijwi  (1)
This equation can be solved by iterative calculations based on the following:ATw=w  (2)where w is the vector of importance scores for the web pages and is the principal eigenvector of AT.
As discussed above, a page ranking algorithm may also factor in that a surfer may randomly select a web page to visit next that is not linked to by the current web page. Thus, the surfer may next visit a target web page of the current web page with a probability of α and next visit a randomly selected web page with a probability of 1−α. To factor in this random selection of web pages, the page ranking algorithm generates an initial transition probability matrix P by normalizing each non-zero row of the adjacency matrix with the sum of its elements. The page ranking algorithm then sets each element of a zero row in matrix P to 1/n to generate transition probability matrix P. The model of representing the random selection of links of target web pages and the random selection of web pages can be represented by the following: P=α P+(1−α)U  (3)where P is the combined transition probability matrix and U is a uniform probability distribution matrix in which each element is set to 1/n. The uniform probability distribution matrix U may be generated by multiplying the jumping vector by the unit vector as represented by the following:{tilde over (P)}=αP+(1−α)etv where e represents the unit vector and v represents the jumping vector. The page ranking algorithm considers the stationary probability distribution π=(π1, π2, . . . , πn)T of the transition probability matrix P to represent the importance of each web page. The page ranking algorithm may compute the stationary distribution through an iterative process as represented by the following:π(t+1)=( P)Tπ(t)  (4)where π(0)=(1, 1, . . . , 1)nT, t represents the iteration count, and the iterative process continues until π converges on a solution. The stationary probability distribution is represented by the principal eigenvector, which may calculated using a standard power iteration technique.
Although a page ranking algorithm can be very useful, in part because it is a query-independent measure of importance, it is especially susceptible to “link spamming.” “Spamming” in general refers to a deliberate action taken to unjustifiably increase the rank, relevance, popularity, importance, and so on of a web page or a web site. In the case of link spamming, a spammer can manipulate links to unjustifiably increase the importance of a web page. For example, a spammer may provide a web page of useful information with hidden links to spam web pages. When many web pages point to the useful information, the importance of the spam web pages is indirectly increased. As another example, many web sites, such as blogging sites and web directories, allow visitors to post links. Spammers can post links to their spam web pages to directly or indirectly increase the importance of the spam web pages. As another example, a group of spammers may set up a link exchange mechanism in which their web sites point to each other to increase the importance of the web pages of the spammers' web sites.
Web spam presents problems for various techniques that rely on web data. For example, a search engine service that orders search results in part based on relevance, popularity, or importance of web pages may rank spam web pages unjustifiably high because of the spamming. Users of such search engine services may be dissatisfied when spam pages are ranked unjustifiably high and may stop using that search engine service. As another example, a web crawler may spend valuable time crawling the links of spam web sites, which increases the overall cost of web crawling and may reduce its effectiveness.