This specification relates to providing search results.
Search engines—and, in particular, Internet search engines—aim to identify resources (e.g., web pages, images, text documents, processes, multimedia content) that are relevant to a user's needs and to present information about the resources in a manner that is most useful to the user. In response to a query submitted by a user, search engines return search results referring to resources identified as relevant to or matching the query.
Spammers typically generate gibberish content such that the search engine returns an identification of resources associated with the gibberish content as relevant to the submitted query. Gibberish content refers to resource content that is likely to represent spam content. For example, gibberish content can include text sequences that are unlikely, based on specified criteria, to represent natural language text strings (e.g., conversational syntax) or to represent text strings that, while not structured in conversational syntax, typically occur in resources (e.g., in web documents). For example, a spammer can generate as gibberish content a web page that includes a number of high value keywords such that the search engine will identify the web page as highly relevant. The gibberish resources can be generated in a number of ways. For example, using low-cost untrained labor, scraping content and modifying and splicing it randomly, and translating from a different language.
The spammer can generate revenue from the traffic to the gibberish web page by including, for example, advertisements, pay-per-click links, and affiliate programs. Moreover, since the gibberish web page was generated using high value keywords without context, the web page typically does not provide any useful information to a user.