The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A search engine is a combination of integrated software components (including data) and an allocation of computational resources, such as memory, a node, and processes on a computer or multiple computers for executing the integrated software components, where the combination of the software and computational resources are dedicated to searching a set of information resources. Search engines generate search results for queries submitted to the search engine. Search engines are widely used on the Internet, the World Wide Web (www, Web, WWW, etc.) and other large internetworks and information resource webs. Often, search engines are publicly accessible as web sites, such as those made available by Yahoo™ and Google™ web pages, which are respectively accessible with the links (http://search.yahoo.com/) and (http://www.google.com/).
The information resources searched by search engines are referred to herein as documents. A document is any unit of information that may be indexed by search engine indexes, which are described below. Often a document is a file which may contain plain or formatted text, inline graphics, and other multimedia data, and hyperlinks to other documents. A document may conform to XML (Extensible Mark-up Language, as promulgated by the World Wide Web Consortium), HTML (Hypertext Markup Language), or other public or private standard (e.g. PDF, Portable Document Format by Adobe™, MS Word by Microsoft™). Documents may be static or dynamically generated.
Search engines use a search engine index (or more than one index), also referred to herein simply as an index, to search for documents. Search engine indexes can be directories, in which content is indexed more or less manually, to reflect human observation. More typically, search engine indexes are created and maintained automatically by processes referred to herein as crawlers. Crawlers explore information over the Internet, essentially continuously, looking for as many documents as they may find at locations to which the crawlers are configured to search. Crawlers may follow links from one document to another, index their content (e.g., semantically, conceptually, etc.) in a search index and summarize them in databases, typically of significant size. It is these indexes and databases that are actually searched in response to a search query.
The search result generated by a search engine comprises a list of documents and may contain summary information about the document. The list of documents may be ordered. To order a list of documents, a search engine may assign a rank to each document in the list. When the list is sorted by rank, a document with a relatively higher rank may be placed closer to the head of the list than a document with a relatively lower rank. A search engine may rank the documents according to relevance to the search query. Relevance is a measure of how closely the subject matter of a document matches a search query's terms. The inclusion of a document within the search engine results generated by a search engine for a search engine query is referred to herein as document recall.
Various nefarious techniques, referred to as search engine spamming, are used to trick search engines into recalling documents and inflating their rank. The techniques generally trick search engine ranking algorithms into recalling and highly ranking documents that contain, for example, sponsored links to a web merchant. The higher ranking increases exposure of such documents to search engine users and can ultimately lead to more revenue for search engine spammers. As a result, some of the most highly ranked results for search engine queries are documents with content that is very irrelevant to the queries and desires of search engine users. Such results are referred to herein as search engine spam.
A typical example of search engine spam is when a user tries to search for the terms “digital camera reviews” and expects to find pages which review various models of digital cameras, detailing performance specifications, sample images and reviewer pros and cons list. Having this expectation when the user clicks on a link for one of the results, the user is instead led to a page that contains nothing but a plethora of keywords and links to other stores where he can buy the camera. This trickery translates to poor user experience and leads to an adverse judgment of search engine performance. Many webmasters may legitimately wish that some content of a page not be indexed by search engines because the content has no relation to the intended focus of the page. A solution that could address this need is to allow a webmaster to designate what sections of the page should not be indexed. However, this opens the door to nefarious techniques for hiding search engine spam. Clearly, there is need for mechanisms that prevent hiding of search engine spam but yet allow webmasters to designate page content that should not be indexed.