Where information is stored in highly structured forms, searching follows well-defined rules. For example, if information about customers is stored in an orders database with each unique customer assigned a unique customer number and each unique item assigned a part number, identifying all of the customers who ordered a particular item can be found by issuing a command to a database manager of the form “table=orders with item-ID=item1 output customer-ID”. However, where information is not as structured, searching is doable, but is more complex. Searching is essential where the user cannot be expected to review the entire set of information looking for what is of interest.
For example, the information might be in the form of unstructured documents. There are many well-known techniques for searching a corpus of documents, where a corpus is some defined set of units of information each referred to as a “document”. A common approach is to index the corpus to arrive at a reverse index indicating where each word (with “stop words” often omitted) is stored with a list of which documents (and possibly locations in those documents) contain the word. A search engine then accepts queries from users (which can be human users using an input device or might be a computer or automated process supplying the queries), consults the index and returns a result set comprising one or more “hits”, wherein a hit is a document in the corpus that is deemed responsive to the query. The result set might comprise the documents themselves, summaries of the documents, and/or references or pointers (such as URLs) to the documents.
Of course, an ideal search engine only returns documents that are in fact responsive to the query, but a search engine cannot always be perfect and thus may return hits that the search engine deems are responsive to the query (i.e., match the request represented by the query), but are not, in the user's opinion, responsive. In some instances, the search engine returns a result set that is exactly responsive. For example, where the query is a Boolean expression “world AND facts BUT NOT weather” and the index is fully up-to-date, a search engine can return exactly the result set of all documents having the words “world” and “facts” that do not also have the word “weather” in them. Unfortunately, search engines that are limited to strict Boolean queries are not that useful where there are large numbers of documents, created in an uncontrolled fashion without a “clean up” process in advance of indexing. Furthermore, users often prefer to provide less structured queries, leaving the search engine to compute possible intents and alter search results accordingly. As just one example, if there were a document labeled “world fact” and did not mention weather, the above-mentioned search engine would miss that document, as it was only looking for the exact string “facts”.
In the general case, searching involves receiving a search query, which might be a string of text or a more structured data object, possibly adding in modifiers from context, such as user demographics, time of day or previous queries, determining from that query object a set of documents from a corpus of documents that are deemed to match the query and returning a result set (or part of a result set if the set is too large).
One heavily searched corpus is the collection of documents stored on servers accessible through the Internet (a global internetwork of networks) and/or similar networks. The documents might be accessible via a variety of protocols (HTTP, FTP, etc.) in a variety of forms (images, HTML, text, structured documents, etc.). This particular corpus presents several difficulties. For one, the number of documents is not known, as there is no central authority that tracks the number of servers and the documents stored on those servers. For another, those documents are constantly changing. Yet another difficulty is that there are a large number of documents. With so many documents available, a typical result set is larger than would be of interest to a typical user. For example, in response to the query “recent sports scores”, a search engine with a respectable index would return a results set of several hundreds of thousands of hits. Thus, a typical result set can be assumed to be too large for its entirety to be of use to the user.
A user cannot be expected to review hundreds of thousands of documents in response to a search query. As a result, the typical search engine will return only a small set (e.g., four, ten, a page full, one hundred, etc.) of results and provide the user the ability to examine more hits than just the initial set, such as by pressing a button or scrolling down. Since users may except to find an acceptable hit in the first page of search results, it is good for a search engine to be able to rank the hits and present the hits deemed most relevant first. The result set might also be filtered using filter criteria so that some documents that would otherwise be part of the result set are eliminated.
With ranking done before display of a result set, the ranking ensures that the higher rated documents are presented first. This process leads to the question of what constitutes a high ranking. Naturally, if someone has set up pages for an e-commerce site and hopes to bring in large amounts of traffic, they would consider that their pages are the highest rated or should be the highest rated, regardless of searcher intent. By contrast, searchers who are not searching in order to make a purchase transaction would actually consider such pages to be quite irrelevant and would want those pages ranked lower. Thus, different entities have different views of how documents in a result set are ranked.
Some businesses known as “search engine optimizers” or SEOs offer a service wherein they advise their customers how to increase the rankings of the customer's web pages, to increase visibility of those pages among searchers in hopes of increasing traffic and therefore sales. Some less than honorable SEOs might advise the use of web spam techniques, wherein false or misleading information is placed in the path of a search engine's crawler that would fool the search engine into thinking that the customer's web pages are more important than they really are, in hopes of being ranked higher. One approach to up-ranking pages is to add irrelevant words to invisible portions of a web page to ensnare more search queries. Another approach is to create a large number of dummy pages (often collectively referred to as a “web spam farm”) that all mention a target page in hopes that a search engine, noting all of those mentions, will up rank the target page.
In the face of the techniques, and since the typical patron of a search engine wants results unbiased by the efforts of SEOs and those who would artificially increase their rankings, search engine operators try to counter those efforts. Some have set up automated systems to detect this artificial inflation of rankings (sometimes referred to as “web spam”). Search engine operators do have manual intervention, for example, if someone complains that someone is generating web spam or that their own pages are being unfairly down ranked, but the operators have limited capacity and often are not focused on these requests.
The corpus used in these examples is the set of documents available over the Internet, a subset of which are hyperlinked documents collectively referred to as the “World Wide Web” or just the “Web”. Where the documents are pages of the Web, typically formatted as HTML documents, they might also be referred to as pages or Web pages.
Matching, such as to bring a page into a result set, is according to operating rules of the search engine. For example, where the search engine allows for fuzzy searches, a query for pages containing “world” and “soccer” and “scores” and 2006 might include pages that do not strictly contain all of those words. Other search engines might only return pages that have all of those words or synonyms of those words.
Some limited attempts to solve this problem have been mentioned in the prior art. For instance in a blog found at www(dot)nicklewis(dot)org/node/335 titled Nick Lewis: The Blog the author speculates that certain content was added intentionally in posting by a third party for the purpose of causing a search engine (Google) to punish the rank rating for the page. An online article entitled “Companies subvert search results to squelch criticism” available at www(dot)ojr(dot)org/ojr/stories/050601glaser/ contains a similar description of such behavior, including instances where positive pages are created to try and boost rankings. In this example the author, because he had direct control over the content of the blog, was able to directly remove the offending materials and avoid the search engine “downgrading.”
Nonetheless there is a need to better overcome the shortcomings of the prior art.