Search engines that enable computer users to obtain references to web pages that contain one or more specified words are now commonplace. Typically, a user can access a search engine by directing a web browser to a search engine “portal” web page. The portal page usually contains a text entry field and a button control. The user can initiate a search for web pages that contain specified query terms by typing those query terms into the text entry field and then activating the button control. When the button control is activated, the query terms are sent to the search engine, which typically returns, to the user's web browser, a dynamically generated web page that contains a list of references to other web pages that contain or are related to the query terms.
Usually, such a list of references will be ranked and sorted based on some criteria prior to being returned to the user's web browser. Web page authors are often aware of the criteria that a search engine will use to rank and sort references to web pages. Because web page authors want references to their web pages to be presented to users earlier and higher than other references in lists of search results, some web page authors are tempted to artificially manipulate their web pages, or some other aspect of the network in which their web pages occur, in order to artificially inflate the rankings of references to their web pages within lists of search results.
For example, if a search engine ranks a web page based on the value of some attribute of the web page, then the web page's author may seek to alter the value of that attribute of the web page manually so that the value becomes unnaturally inflated. For example, a web page author might fill his web page with hidden metadata that contains words that are often searched for, but which have little or nothing to do with the actual visible content of the web page. For another example, a web page author might create many domains and generate links from those domains to his web page in order to artificially boost the number of links to his web page so that it appears to the search engine that his web page is popular. Such techniques are referred to as “spamming” and websites that contain web pages created for such a purpose is referred to as “web spam.”
When web page authors engage in these tactics, the perceived effectiveness of the search engine is reduced. References to web pages which have little or no actual “earned” merit (i.e. web spam) are sometimes pushed above references to web pages that users have previously found interesting or valuable for legitimate reasons. Thus, it is in the interests of those who maintain the search engine to “weed out,” from search results, references to web pages that are known to have been artificially manipulated in the manner discussed above. However, because there are so many web pages accessible through the Internet, and because the Internet is a dynamic entity, always in flux, manually examining and investigating every existing web page is a daunting and expensive, if not downright futile, task. Furthermore, web spam authors often “move” their web pages to a different domain in order to deter detection and continue their work, and thus it is difficult to track such authors.
What is needed is an automated way of identifying web pages that are likely to have been manipulated in a manner that artificially inflates rankings of references to those web pages within lists of search results.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.