This invention relates generally to search engines used on the World Wide Web, and more particularly to estimating the relative sizes and overlap of indexes maintained by these search engines.
In recent years, there has been a dramatic increase in the amount of content that is available on the World Wide Web (the xe2x80x9cWebxe2x80x9d). Typically, the content is organized as HTML Web pages. The total number of pages accessible through the Web is estimated to number in the hundreds of millions. In order to locate pages of interest, a large number of public search engines are currently in operation, for example, AltaVista, Infoseek, HotBot, Excite, and many others.
A typical search engine will periodically scan the Web with a xe2x80x9cspiderxe2x80x9d or xe2x80x9cweb crawlerxe2x80x9d to locate new or changed Web pages. The pages are parsed into an index of words maintained by the search engine. The index correlates words to page locations. Then, using a query interface, users can rapidly locate pages having specific content by combining keywords with logical operators in queries. Usually, the search engine will return a rank ordered list of pages which satisfy a query. The pages are identified by their Universal Resource Locators (URLs), and a short excerpt. The user can than use a standard Web browser to download interesting pages by specifying their URLs, most often using xe2x80x9chotxe2x80x9d links.
Another type of search engine, called a meta-search enginexe2x80x94e.g., xe2x80x9chttp://www.metacrawler.comxe2x80x9d which accepts a query from a user, and passes the query to a number of conventional search engines. Meta-search engines may well be useful if the amount of overlap between indexes of popular search engines is low.
Therefore, users and designers of search engines are often interested in knowing how good the coverage is of different search engines. Here, coverage means the relative sizes of the indexes, i.e., the number of pages indexed, and the relative amount of overlap between indexes, i.e., the number of pages of one search engine indexed by another.
However, currently there is no good way to measure relative coverage of public search engines. Although many studies have tried to measure coverage, the studies often reach contradictory conclusions since no standardized test has been defined. A large bibliography of such studies is maintained at: http://www.ub2.1u.se/desire/radar/lit-about-search-services.html.
Most comparisons are highly subjective since they tend to rely on information such as spider-access logs obtained from a few sites. Often, they make size estimates by sampling with a few arbitrary chosen queries which are subject to various biases and/or using estimates provided by the search engines themselves. In either case, this makes the estimates unreliable.
For example, if a search engine claims a search result of about 10,000 pages, then the result may well include duplicate pages, aliased URLs, pages which since have been deleted. In fact, the search engine itself may only scan a small part of its index, say 10%, and return the first couple of hundred pages. The total number of qualifying pages that it thinks it has indexed and could have returned is just an extrapolation.
Therefore, it is desired to provided a standardized method for measuring the relative coverage of search engines. It should be possible to work the method without having privileged access to the internals of the search engines. That is, it should be possible to estimate the coverage from public access points.
A method is provided for estimating coverage of search engines used with the World Wide Web. Each search engine maintains an index of words of pages located at specific addresses of a network. A random query is generated. The random query is a logical combination of words found in a subset of Web pages. Preferably, the training set 311 of pages is representative of the pages on the Web in general, or possibly a particular domain.
The random query is submitted to a first search engine. The first search engine returns a set of addresses in response. The set of addresses identify pages indexed by the first search engine. A particular address identifying a sample page is randomly selected from this set, and a strong query is generated for the sample page. The strong query is highly dependent on the content of the sample page. The strong query is submitted to other search engines.
The results received from the other search engines are compared to information about the sample page to determine if the other search engines have indexed the sample page. In other words, random queries are used to extract random pages from one search engine, and strong queries derived from the random pages are used to test if other search engines have indexed the page. Thus, the relative size and overlap between the first and other search engines can be estimated.
In one aspect of the invention, a lexicon of words is constructed from the a training set of pages, and the frequencies of unique words in the lexicon is determined. The lexicon and word frequencies can be used to select words combined into the random query. The random query can be disjunctive or conjunctive. In another aspect of the invention, the strong query is a disjunction of a two conjunctive queries.