The present invention relates generally to search systems and more particularly to search systems that rank search hits in a result set.
Searching is useful where an entire corpus cannot be absorbed and an exact pointer to desired items is not present or is not possible. In general, searching is the process of formulating or accepting a search query, determining a set of matching documents from a corpus of documents and returning the set or some subset of the set if the set is too large. In a specific example, which this disclosure is not limited to, consider searching the set of hyperlinked documents referred to as the “Web”. The corpus contains many searchable items, referred to herein as pages or, more generically, documents. A search engine identifies documents from the corpus that match a search query, typically using an index generated in advance of receipt of a search query. A “match” can mean many things and a search query can be of various forms. Commonly, a search query is a string comprising one or more words or terms and a match happens when a document includes one or more of the words or terms (or all of them) from the search query string. Each matching document is referred to as a hit and the set of hits is referred to as the result set or the search results. The corpus can be a database or other data structure or unstructured data. The documents are often Web pages.
A typical index of Web pages contains billions of entries, so a common search might have a result set comprising millions of pages. Clearly, in such situations, the search engine might have to constrain the result set further in order that what is returned to the querier (which is typically a human computer user, but need not be the case) is of a reasonable size. One approach to constraining the set is to present the search results in an order with the assumption that the user will only read or use a small number of hits that appear higher in the ordered search results.
Because of this assumption, many Web page authors desire that their pages appear high in the ordered search results. A search engine relies on various features of the relevant pages to select and return only the highest quality ones. Since top positions (high ranking) in a query result list may confer business advantages, authors of certain Web pages attempt to maliciously boost the ranking of their pages. Such pages with artificially boosted ranking are called “web spam” pages and are collectively known as “web spam.”
There are a variety of techniques associated with web spam. One is to make a Web page artificially appropriate for being selected by many queries. This can be achieved by augmenting a page with massive numbers of terms that are unrelated to the essential content and are rendered in small or invisible fonts. Such augmentation makes a page more exposed (i.e., potentially relevant to more queries), but does not truly improve its relevance for any particular query. In this regard, authors of spam use another technique: they add to a page many incoming (hyper)links, also called inlinks, based on the observation that pages more frequently referenced by others are generally considered by search engines as being preferable (of higher relevance). It is difficult to distinguish between real high-quality pages referenced by many others due to their superior value, and web spam with many inlinks.
Identification of web spam pages and their subsequent demotion in a search result list is important for maintaining or improving the quality of answers produced by a search engine. Thus, web spam detection is a useful task for a search engine. Human editors are frequently employed to identify web spam by verifying large numbers of pages present in the search engine index, but that is often impractical.
Therefore, there is a need for an improved search processing that overcomes web spam and provides search results that are more in line with what users want rather than in line with manipulations of document authors.