When crawling URLs on a network, a URL crawler imposes a limit on the crawling throughput for each host, as measured by a number of URL crawls allowed per unit of time. For example, the URL crawl capacity may be 10 URLs per second for the particular host(s) of the URL. This limit is motivated by the need to avoid putting excessive load on the host, as well as the desire to respect the URL host's explicitly stated preferences with regard to the available crawl capacity of the URL host for URL crawlers.
In the event that this crawl capacity limit is shared among several applications, any individual application is faced not only with limited crawl capacity (that may change with time) but also other applications competing for the same limited crawl capacity of various URLs at the same host at any given time. In this situation, the pending URL crawls from the competing applications wait, and the URL crawler takes pending URL crawls at a rate not exceeding the available crawling capacity. If an application's total crawling needs are not greater than the capacity available to this application, the application will have all its crawling demands eventually satisfied. However, if the application's crawling demands are more than the capacity available to this application, then as the application uses up all available crawling capacity, pending URL crawls at the tail of the queue remain uncrawled, regardless of the relative importance of the various pending URL crawls. It would be desirable to perform pending URL crawls with low latency for the more important URL crawls requested by applications.