Web search engines are currently in wide use, and are used to return a ranked list of web sites in response to a search query input by a user. It can be very valuable to have a web page returned high in the ranked list of web pages for a wide variety of different queries. This may increase the likelihood that a user will view a given web page.
Therefore, in order to increase web traffic to a given site, the authors of certain cites have tried to artificially manipulate the ranked list returned by search engines such that the web sites authored by those authors are ranked higher then they would normally be ranked. The particular manipulation techniques used by such authors depends on how a given web search engine ranks the pages for a given query. Any of the different manipulation techniques used by such authors are referred to as “spamming” techniques.
Some search engines use link analysis algorithms in order to generate the ranked list of web pages returned in response to a query. In general, link analysis algorithms identify an importance of a given web page, based upon the number of links that point to that web page. It is assumed that related web pages (those that have related content) have links to one another. Therefore, the more links that point to a web page, the more important the web page may be regarded by the search engine.
In order to manipulate this type of search engine, web spammers (those employing spamming techniques) sometimes attempt to create a large number of links to their web pages by having unrelated web pages (web page with unrelated content) linked to their web pages. This can be done using automated techniques to post links to their web sites onto other web pages, or simply by creating a large number of their own web pages and web sites, and then placing links in those web pages and web sites to all the other web pages and web sites which they created. This increases the number of links to any given web page or web site created by the author, regardless of whether it has related content. Similarly, some web sites reciprocally exchange links. When two unrelated web sites exchange links, at least one, and possibly both, of them are very likely to be spam (web sites that receive the benefit of spamming techniques).
It can be seen that spamming techniques can produce spam that misleads a search engine into returning low quality, or even entirely irrelevant, information to a user in response to a query. Therefore, a number of techniques have been developed in order to identify spam so that it can be removed from the ranked search results returned by a search engine. For instance, human experts can generally identify web spam in a very effective manner. However, it is quite easy for a spammer to create a large number of spam pages and to manipulate their link structure. It is thus impractical to detect web spam using only human judges. Therefore, some automatic approaches have been developed to identifying spam. One category of such approaches is referred to as a supervised approach in which some known examples of spam are provided to the system, and the system learns to recognize spam from those examples.
One such technique builds a ranking measure for web pages modeled on a user randomly following hyperlinks through the web pages. This ranking measure is well known as PageRank used by the Google search engine. At each web page, the modeled user either selects an outlink uniformly at random to follow with a certain probability, or jumps to a new web page selected from the whole web uniformly at random with the remaining probability. The stationary probability of a web page in this “random walk” is regarded as the ranking score of the web page. The basic assumption behind such a technique is that a hyperlink from one page to another is a recommendation of the second page by the author of the first page. If this assumption is recursively applied, then a web page is considered to be important if many important web pages point to it.
By using random jumps to uniformly selected pages, this system accommodates the problem that some high quality pages have no out links, although they are pointed to by many other web pages.
This concept of random jumps has also been adopted, in another way, to address the problem of web spam. Basically, the random user described above is allowed to jump to a set of pages (seed pages) which have been judged as being high quality, normal pages, by human experts. Assuming this choice for the random jumps, the stationary probability of a web page is regarded as its trust score, and a web page with a trust score smaller than a given threshold value is considered to be spam.
This type of system can also be understood as follows: initially, only the selected good seed pages have trust scores equal to one, and the trust scores of other web pages are zero. Each seed page then iteratively propagates its trust score to its neighbors, and its neighbors further propagate their received scores to their neighbors. The underlying assumption in this algorithm is that web pages of high quality seldom point to spam pages.
A counterpart to this algorithm allows the random web user to either select an inlink uniformly at random to follow, in reverse, with a certain probability, or jump to a new web page randomly selected from a web page set which has been judged as spam by human experts with the remaining probability. The stationary probability of a web page is, in this system, referred to as its antitrust rank, or antitrust score. A web page will be classified as spam if its score is larger than a chosen threshold value. In terms of the propagation understanding, the scores in this system are propagated in the reverse direction along the inlinks. The basic underlying assumption of this type of system is that a web page pointing to spam pages is likely to be spam, itself.
Another system is referred to as a functional ranking system. It considers a general ranking function that depends on incoming paths of various lengths weighted by some chosen damping function that decreases with distance. In other words, links from pages that are a greater distance from the subject web page are weighted by weight that is damped less than links from closer web pages. That is, spam pages may gain an artificially high score under a system that simply ranks the pages based on the number of links to it, because a spam page may be formed by using a spamming technique to have many incoming links from its immediate neighbor pages. However, spam pages of this type can be demoted using this system by choosing a damping function that ignores the direct contribution of links from pages directly adjacent the given page, and only valuing links that start at least one link away from the subject page.
Yet another technology to be considered is general machine learning technology. In this technology, features must be selected that are useful in detecting spam, and each web page is then represented as a vector having each element described by one type of spam feature. The features can be the number of inlinks, the number of outlinks, scores under any of the above-mentioned algorithms, etc. Then, a classifier is chosen, such as a neural network, a decision tree, a support vector machine (SVM), etc., and it is trained with a set of examples of normal and spam web pages which have been judged by human experts. The trained classifier is then used to predict a given web page as spam or not spam (i.e., as spam or a content page). One difficulty with this methodology is that the efficiency of a spam feature is generally validated only on the web pages which are not sampled from the entire web uniformly at random, but instead from large websites and highly ranked web pages. Consequently, the trained classifier is biased to those selected pages, and it does not generalize well to the entire web.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.