As more and more people rely on the wealth of online information, increased exposure on the Web may yield significant financial gains for individuals or organizations. Most frequently, search engines are the entryways to the Web. Often, when a user searches the web using a search engine, only top-ranked pages receive the attention of the user. In general, the higher the ranking is, the greater is the chance to receive the attention of the user. While search engine ranking aims to provide the most relevant information to users, owners of webpages all desire a higher ranking by the search engine in order to gain an advantage over others. For this reason, some people try to mislead search engines, so that their pages would rank artificially high in search results, and thus, capture undeserved user attention. Web spamming refers to such actions intended to mislead search engines into ranking some webpages higher than they deserved.
Web spamming is the major problem for search engines. Web spamming can significantly deteriorate the quality of search engine results. It is also a cause of big costs for search engines to crawl, index, and store the spam pages. Web spamming is also a serious problem for Web users because the users are not aware of the spamming practice and tend to trust the result of a search based on a general reputation of the search engine used.
There is a variety of Web spamming techniques, all specifically targeting search engine ranking techniques. One practice is to introduce artificial text into webpages, and another is to introduce page links, to affect the result of searches. The latter is called link spamming, which is one of the popular web spam techniques, as further discussed below.
Web spamming techniques have also evolved in time. The first generation spam involved keyword stuffing when ranking was dependent on document similarity. The second generation spam involved link farms when ranking was largely dependent on site popularity. The third generation spam uses mutual link exchange through “mutual admiration societies” when ranking is largely dependent on page reputation. In general, the third-generation Web spamming is harder to detect than the previous generations.
Link spamming techniques, which include busying/selling links, exchanging links, and constructing link farms, are a major category of the commonly used spam techniques. Link spamming refers to the cases where spammers set up structures of interconnected pages in order to boost their rankings in link structure-based ranking system such as PageRank. Since link analysis is a crucial factor for commercial search engines, link spam is among the most popular and harmful techniques for search engines nowadays.
Many anti-link spam methods, such as TrustRank, BadRank, and SpamRank, have been proposed to tackle the problem. Certain methods of automatically finding and then penalizing the link spamming have been introduced. Automatic detection is important because while human experts may be able to identify spam, it is too expensive to manually evaluate a large number of pages.
For example, TrustRank is a link analysis technique used for semi-automatically separating useful webpages from spam. TrustRank combats web spam by propagating trust among web pages. The method selects a small set of seed pages to be evaluated by an expert. Once the reputable seed pages are manually identified, a crawl extending outward from the seed set seeks out similarly reliable and trustworthy pages. TrustRank's reliability diminishes as documents become further removed from the seed set. This type of propagation may be suited for propagating authority, but it is not optimal for calculating trust scores for demoting spam sites.
In comparison, BadRank is an anti-spamming technique which downgrades pages that are found within a linking network that fits the characteristics of a spam. BadRank has been used by search engines against link farms. BadRank is practically an inverse PageRank, in which a page will get a high score if it points to many pages with a high BadRank score. SpamRank thus resembles an “opposite TrustRank”. One advantage of SpamRank over TrustRank is that good pages cannot be marked as spam.
Furthermore, the concept of spam mass, a measure of the impact of link spamming on a page's ranking, has also been introduced. There have been discussions of how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. Other proposed techniques targeted a different type of noisy link structures, residing at site level. These techniques investigated and tried to eliminate or frustrate site level mutual reinforcement relationships, abnormal support coming from one site towards another, and complex alliances between web sites.
All above methods are based on heuristics or statistical properties, and they cannot effectively resist spam in certain situations. With the existing anti-spamming techniques, link spam problem has yet to be solved. Given the importance of search engine anti-spamming, it is desirable to develop new anti-spamming techniques to protect the integrity of search engine ranking.