As the quantity of information accessible on the internet has grown, and continues to grow, the most widely used and popular web sites have become the sites of internet search providers devoted primarily to finding information elsewhere on the internet. Search providers operate what are known as search engines which provide a user interface to a large database that associates an index of terms with addresses of information determined to be relevant to those terms. Such addresses typically consist of a Uniform Resource Locator (URL) corresponding to a web page accessible via the hypertext transport protocol (HTTP).
Search providers typically build a search database by analyzing the content of information accessible on the internet. In the context of web pages, search providers utilize automated programs—robots, or spiders—which “crawl” web sites by following links, retrieving published content, and then indexing the published content in accordance with proprietary weighting algorithms. Such content may also be indexed in conjunction with other information available from public sources, such as telephone or business directories.
Search providers struggle with the issue of providing results which are relevant to the queries entered by users. The problem of relevance has several aspects. One of the intriguing features of the internet is that it is borderless—a web page from a business a block away from the user is as accessible as one on the opposite site of the globe. A user needing a haircut, however, is likely to consider a local barber shop to be more relevant to his or her problem than information about the world's largest barber shop half a continent away. In that context geographic location is a high component of actual relevance. In other contexts, other particular information about the publisher of internet content may be a higher component of actual relevance than what can be gleaned by crawling an indexing the publisher's content. For example, knowing the type of business in which a particular entity engages may be more relevant to a searcher, and more meaningful, than the name of the business.
Complicating the problem of providing relevant information is the conscious behavior of publishers to attempt to artificially increase the apparent relevance of their content in connection with particular search queries. The hypertext mark-up language (HTML) itself includes definitions of meta-tags, which were originally intended to provide meta information, such as keywords and content summaries, apart from page content, for the purpose of indexing web content. However, in view of the value of high search engine rankings in connection with various search terms, internet publishers soon began to engage in “keyword spamming”—the repetitive use of terms within a web page for the sole purpose of increasing apparent search engine relevance in connection with those terms. Search providers have found that meta-tags and simple word-counting measures of the content and relevant index terms for a web page are useless, and in fact now de-rank sites which appear to be engaged in artificial relevance-boosting techniques. The exact ranking mechanisms used by search providers have become trade secrets, because to publish those mechanisms is to provide a road map to abuse. The result is a cat-and-mouse game between search providers and unethical publishers to discover the techniques by which one seeks to defeat the plans of the other.
The arms race between search providers and publishers arises primarily by the manner in which search providers collect information by crawling published content on the internet. To provide, for example, geographically relevant information, search engine providers utilize algorithms for detecting the presence of postal addresses on a web page. However, so long as information about a publisher is obtained through the same anonymous channel as the published information itself, it will continue to be the subject of abuse and/or ambiguity. Of course, not all failures of relevance are the result of abuse, but merely reflect the limits of language. Someone selling “Jefferson Airplane Tickets” may be selling souvenirs of 1960's rock music, or travel to the state of Missouri. In the absence of such information as whether the publisher operates a travel agency or a memorabilia store, there is no search indexing algorithm which will detect the difference.