The Internet has vast amounts of information distributed over a multitude of computers, hence providing users with large amounts of information on various topics. This is also true for a number of other communication networks, such as intranets and extranets. Although large amounts of information may be available on a network, finding the desired information is usually not easy or fast.
Search engines have been developed to address the problem of finding desired information on a network. Typically, a user who has an idea of the type of information desired enters one or more search terms to a search engine. The search engine then returns a list of network locations (e.g., uniform resource locators (URLs)) that the search engine has determined to include an electronic document relating to the user-specified search terms. Many search engines also provide a relevance ranking. A typical relevance ranking is a relative estimate of the likelihood that an electronic document at a given network location is related to the user-specified search terms in comparison to other electronic documents. For example, a conventional search engine may provide a relevance ranking based on the number of times a particular search term appears in an electronic document, its placement in the electronic document (e.g., a term appearing in the title is often deemed more important than if appearing at the end of the electronic document). In addition, link analysis has also become a powerful technique in ranking web pages and other hyperlinked documents. Anchor-text analysis, web page structure analysis, the use of a key term listing, and the URL text are other techniques used to provide a relevance ranking.
Creators of electronic documents often complicate the problem of relevance ranking through deliberate efforts to present their electronic documents to a user. For example, some creators attempt to induce a search engine to generate higher rank figures for their electronic documents than may otherwise be warranted. Deliberate manipulation of an electronic document by its creator in an attempt to achieve an undeservedly high rank from a search engine is generally referred to as search engine spamming. The goal of a search engine spam is to deceitfully induce a user to visit a manipulated electronic document. One form of manipulation includes putting hundreds of key terms in an electronic document (e.g., in meta tags of the electronic document) or utilizing other techniques to confuse a search engine into overestimating (or even incorrectly identifying) the relevance of the electronic document with respect to one or more search terms. For example, a creator of a classified advertising web page for automobiles may fill the “key term” section with repetitions of the term “car.” The creator does this so that a search engine will identify that web page as being more relevant whenever a user searches for the term “car.” But a “key term” section that more accurately represents the subject matter of the web page may include the terms “automobile,” “car,” “classified,” and “for sale.”
Some other techniques to create search engine spam include returning a different electronic document to a search engine than to an actual user (i.e., a cloaking technique), targeting a key term unrelated to an electronic document, putting a key term in an area where a user will not see it to increase key term count, putting a link in an area where a user will not see it to increase link popularity, producing a low-quality doorway web page, deceitfully redirecting a user from a highly ranked electronic document to an irrelevant electronic document to present the irrelevant electronic document to the user, and so on. The result is that a search engine provides a user who runs a query a highly ranked electronic document that may not be truly relevant. Thus, the search engine does not protect the user against such deliberate ranking manipulation.
Existing search engines attempt to prevent search engine spam by separately analyzing each spam technique to identify a pattern of a manipulated electronic document. When such search engines detect an electronic document that has the identified pattern, then the search engines label the electronic document as spam to avoid presenting the electronic document to a user in a search result or to demote the result. For example, a particular search engine may label an electronic document that is primarily built for the search engine rather than for an end-user as a search engine spam. Similarly, a search engine may detect a hidden text and/or a hidden link in an electronic document and label this electronic document as a search engine spam. Some search engines may also detect a web site that has numerous unnecessary host names (e.g., poker.foo.com, blackjack.foo.com, etc.) or with excessive cross-links used to artificially inflate the web site's apparent popularity and label this web site as a search engine spam. Moreover, existing search engines may detect a web site that employs a cloaking technique or link farming by which the web site exchange a reciprocal link with another web site to increase search engine optimization.
In contrast to a search engine spam, an electronic mail (or e-mail) spam is an unsolicited e-mail message usually sent to many recipients at a time. An e-mail spam is the electronic equivalent of a junk mail. In most cases, the content of an e-mail spam message is not relevant to the interests of the recipient. Thus, creating an e-mail spam is an abuse of the Internet to distribute a message to a huge number of people at a minimal cost.
An e-mail spam is distinguished from a search engine spam in a number of ways. For example, a program may automatically generate an e-mail message for sending an e-mail spam to a large number of recipients. In contrast, a search engine spam does not involve an e-mail address, a sender, or a recipient. But a search engine spam nonetheless shares certain characteristics with an e-mail spam. For example, both search engine spam and e-mail spam are undesirable in that they are both created to deceitfully induce a user to visit a particular product or service. Accordingly, more often than not, a creator of an e-mail spam may also generate a search engine spam to increase the exposure of one or more electronic documents relating to a product or service. That is, spammers often rely on both e-mail spam and search engine spam to market a product or service. As such, there is generally a strong correlation between e-mail spam and search engine spam. Nevertheless, prior art systems and methods have overlooked such a correlation between the possible sources of e-mail spam and search engine spam. Specifically, the prior art treats e-mail spam and search engine spam as separate problems requiring entirely different solutions.
Accordingly, a solution that effectively identifies and prevents search engine spam is desired.