In recent years, there has been a dramatic increase in the amount of content that is available on the World Wide Web (the “Web”). Typically, the content is organized as HTML Web pages. The total number of pages accessible through the Web is estimated to number in the hundreds of millions. In order to locate pages of interest, a large number of public search engines are currently in operation, for example, Alta Vista, Infoseek, HotBot, Excite, and many others.
A typical search engine will periodically scan the Web with a “spider” or “web crawler” to locate new or changed Web pages. The pages are parsed into an index of words maintained by the search engine. The index correlates words to page locations. Then, using a query interface, users can rapidly locate pages having specific content by combining keywords with logical operators in queries. Usually, the search engine will return a rank ordered list of pages which satisfy a query. The pages are identified by their Universal Resource Locators (URLs), and a short excerpt. The user can than use a standard Web browser to download interesting pages by specifying their URLs, most often using “hot” links.
Another type of search engine, called a meta-search engine—e.g., “http://www.metacrawler.com” which accepts a query from a user, and passes the query to a number of conventional search engines. Meta-search engines may well be useful if the amount of overlap between indexes of popular search engines is low.
Therefore, users and designers of search engines are often interested in knowing how good the coverage is of different search engines. Here, coverage means the relative sizes of the indexes, i.e., the number of pages indexed, and the relative amount of overlap between indexes, i.e., the number of pages of one search engine indexed by another.
However, currently there is no good way to measure relative coverage of public search engines. Although many studies have tried to measure coverage, the studies often reach contradictory conclusions since no standardized test has been defined. A large bibliography of such studies is maintained at: http://www.ub2.1u.se/desire/radar/lit-about-search-services.html.
Most comparisons are highly subjective since they tend to rely on information such as spider-access logs obtained from a few sites. Often, they make size estimates by sampling with a few arbitrary chosen queries which are subject to various biases and/or using estimates provided by the search engines themselves. In either case, this makes the estimates unreliable.
For example, if a search engine claims a search result of about 10,000 pages, then the result may well include duplicate pages, aliased URLs, pages which since have been deleted. In fact, the search engine itself may only scan a small part of its index, say 10%, and return the first couple of hundred pages. The total number of qualifying pages that it thinks it has indexed and could have returned is just an extrapolation.
Therefore, it is desired to provided a standardized method for measuring the relative coverage of search engines. It should be possible to work the method without having privileged access to the internals of the search engines. That is, it should be possible to estimate the coverage from public access points.