A search engine or search engine program is a widely used mechanism for allowing users to search vast numbers of documents for information. Automated search engines locate websites by matching terms from a user entered search query to an indexed corpus of web pages. A conventional network search engine, such as the Google™ search engine, returns a result set in response to the search query submitted by the user. The search engine performs the search based on a conventional search method. For example, one known method, described in an article entitled “The Anatomy of a Large-Scale Hypertextual Search Engine,” by Sergey Brin and Lawrence Page, assigns a degree of importance to a document, such as a web page, based on the link structure of the web page. The search engine ranks or sorts the individual articles or documents in the result set based on a variety of measures, such as, the number of times the search terms appear in the document and the number of documents that contain a link to a document. A search result set comprising a ranked list of documents with a link to each document can be returned to the user.
Publishers or authors of a document, such as a web page or a dynamically generated web page, can use a variety of techniques to manipulate the document to increase the ranking of the document by a search engine. Given a high ranking, a user is more likely to click on the manipulated document from the search results. Manipulation techniques that can, for example, be used are: using the domain name of a once legitimate document; filling the text of the document or anchor text associated with links in the document with certain popular query terms; automatically creating links from other documents to the manipulated document; and presenting a different document to the web crawler than to the users. These manipulated documents can be referred to as spam. When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user's query. A user's search experience can be degraded if the search engine returns a search results set containing manipulated documents.
Manually determining whether documents are manipulated is extremely time intensive. Conventional methods exist for automatically identifying signals indicating a manipulated document. These conventional methods are performed on a document by document basis and often the signals for an individual document can be too weak to give a strong signal indicating that the document is manipulated.
Thus, a need exists to identify manipulated documents to prevent them from being in search results or lower the ranking of manipulated documents in search results.