In general, when searching for information with a search engine, the number of web pages that can reasonably be returned as relevant in a given search is far too large for a human user to digest. To provide effective search methods under these conditions, methods are needed to filter from a huge collection of relevant pages, a small set of the most authoritative or definitive ones. Search engines use the link structure of a web graph to rank the importance of web pages and their relevance to a particular subject in order to facilitate the filtering process. Two of the best-known algorithms for this purpose are the page-rank algorithm and the hubs and authorities algorithm. The page-rank is the algorithm used by the Google search engine, and was originally formulated by Sergey Brin and Larry Page in their paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine.” It is based on the premise, prevalent in the world of academia, that the importance of a research paper can be judged by the number of citations the paper has from other important research papers. Brin and Page have simply transferred this premise to its web equivalent—the importance of a web page can be judged by the number of hyperlinks pointing to it from other web pages.
The page-rank of a web page is calculated as a linear combination of two terms: (i) the sum of the page rank of each page linking to it divided by the number of links on that page, and (ii) a constant term, referred to as random restart. From a search engine marketer's point of view, this implies there are two ways in which page rank can affect the position of a page. First, the number of incoming links. Clearly, the more incoming links one has the better ranking that can be received. There is also another aspect that the algorithm informs: no incoming link can have a negative effect on the page rank of the page it points at. At worst, it can simply have no effect at all. Secondly, the number of outgoing links on the page which points at a given page affects the ranking of the latter. In other words, the ranking of a page increases if the pages pointing to it have fewer outgoing links. This implies that given two pages of equal page rank linking to a respective page, one with outgoing links and the other with 10, one will get twice the increase in page rank from the page with only 5 outgoing links. These known aspects of the algorithm can be exploited by web spammers however to artificially increase the popularity of their respective web pages.
Similarly, the hubs and authorities algorithm can be exploited. In general, Hyperlinks encode a considerable amount of latent human judgment. By creating links to another page, the creator of that link has “conferred authority” on the target page. Links afford the opportunity to find potential authorities purely through the pages that point to them. Generally, this algorithm model is based on the relationship that exists between the authorities for a topic and those pages that link to many related authorities, where pages of this latter type are referred to as hubs.
Web spammers have learned how to exploit the link structure employed by ranking algorithms to improve their rank in search engines. The main method to detect web spam is based on the content of the web pages. But this is very costly in terms of processing time. Moreover, if web pages are ranked for the purpose of giving priority during the crawling stage, some information should be extracted about the web spam without yet having complete information on the content of the pages. Hence, methods are needed to detect web spam in an efficient manner and based on the link structure.