Unsolicited content, often referred to as “spam,” is problematic in that large amounts of undesirable data are sent to and received by users over various electronic media including the World Wide Web (“web”). Spam can be delivered using e-mail or other electronic content delivery mechanisms, including messaging, the Internet, the web, or other electronic communication media. In the context of search engines, crawlers, bots, and other content discovery mechanisms, undesirable content on the web (“web spam”) is a growing problem, and a mechanism for its detection is needed. Search engines, therefore, have an incentive to weed out spam web pages, so as to improve the search experience of their customers.
For example, when a search is performed, all web pages that fit a given search may be listed in a results page. Included with the search results pages may be web pages with content that is of no value to a user and that was generated to specifically increase the visibility of a particular web site. Further, search engines rank pages using various parameters of the pages. Search engines use a conventional technique to increase the rank of a page by determining the inbound links. Search engines typically rank a page higher when that page has more inbound links than a web page with fewer inbound links. Some web sites, however, attempt to artificially boost their rankings in a search engine by creating spurious web pages that link to their home page, thereby generating significant amounts of unusable or uninteresting data for users. A further problem associated with web spam is that it can slow or prevent accurate search engine performance.
Search engines have taken pivotal roles in web surfers' lives: Most users have stopped maintaining lists of bookmarks, and are instead relying on search engines such as Google, Yahoo! or MSN Search to locate the content they seek. Consequently, commercial web sites are more dependant than ever on being placed prominently within the result pages returned by a search engine. In fact, high placement in a search engine is one of the strongest contributors to a commercial web site's success.
For these reasons, a new industry of “search engine optimizers” (SEOs) has sprung up. Search engine optimizers promise to help commercial web sites achieve a high ranking in the result pages to queries relevant to a site's business, and thus experience higher traffic by web surfers.
In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. In fact, some SEOs go one step further: Instead of manually creating pages that include unrelated but popular query terms, they machine-generate many such pages, each of which contains some monetizable keywords (i.e., keywords that have a high advertising value, such as the name of a pharmaceutical, credit cards, mortgages, etc.). Many small endorsements from these machine-generated pages result in a sizable page rank for the target page. In a further escalation, SEOs have started to set up DNS servers that will resolve any host name within their domain, and typically map it to a single IP address.
Most if not all of the SEO-generated pages exist solely to mislead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors.
In view of the foregoing, there is a need for systems and methods that overcome such deficiencies.