A search engine system is essentially a distributed database system. It provides a powerful source of indexed documents from the Internet (or an Intranet). On the front end of a search engine, a user submits a search query that usually comprises two or three keywords suggesting the user's search interest. In response, the search engine returns a series of links to documents that match the search query in accordance with a set of predefined search criteria. On the backend, the search engine employs web crawlers that retrieve content of hundreds of millions of web pages stored in various web hosts.
A web crawler, also known as a spider or a wanderer, is a special software program that automatically traverses the Internet. The capacity of a web crawler, i.e., the number of documents crawled by a web crawler per unit time, is limited by the computer hardware resources (for example, disk space and network bandwidth) available for the crawler. To reduce the time interval between two adjacent visits to a web page by a web crawler and thereby improve the freshness of search results, a search engine often fragments the Internet into multiple sub-spaces and dispatches multiple web crawlers to crawl the Internet simultaneously, each web crawler responsible for accessing one of the sub-spaces.
One approach of fragmenting the Internet is to group web pages at different web hosts into different categories according to their contents. For example, a “news” category may include web pages whose contents are closely related to news and an “education” category may contain web pages that are deemed to be related to education. Accordingly, one or more news web crawlers are directed to crawl web pages in the “news” category and a certain number of education web crawlers are devoted to dealing with those web pages in the “education” category.
It is quite common that a web host may store web pages belonging to multiple categories. For instance, a web portal like www.yahoo.com often has one sub-directory like education.yahoo.com storing education-related information and another sub-directory like news.yahoo.com storing news-oriented information. As a result, there can be multiple web crawlers associated with different categories submitting document retrieving requests to a web host simultaneously, each request competing against other requests for the resources (load capacity) of the web host.
On the other hand, the load capacity of a web host is often limited by the web host's hardware setup. When the simultaneous requests for load capacity from various web crawlers are above the maximum load capacity a web host can provide, it is almost certain that some of the competing web crawlers will receive slow service from the web host, and some requests may even fail. Such phenomenon is sometimes referred to as “load capacity starvation” for a web crawler. Load capacity starvation prevents web crawlers from retrieving documents from a web host and passing them to an indexer in a timely fashion, which adversely affects both the web host and the freshness of search results generated by the search engine.
Another problem with the fragmentation strategy discussed above is that the interaction between web crawlers and web hosts is a highly dynamic process. The amount of load capacity that a web crawler requires of a particular web host may vary significantly over time, and even from minute to minute. Therefore a permanent or long term grant of load capacity for a particular web host to a particular web crawler is likely to be inefficient and may prevent other web crawlers from obtaining needed load capacity.