The Internet contains billions of documents (e.g., web pages) that are identified by respective uniform resource locators (URLs). Internet search engines index these documents, rank them, and perform queries against them. Web crawlers are applications that download web pages and index the downloaded web pages (and respective URLs) according to a particular categorization scheme. Web crawlers are often utilized to populate the document indices upon which search engines rely.
Web pages can be classified into different categories such as academic papers, commercial products, customer reviews, news, blogs, etc. Each of these categories only represents a portion of all documents available on the Internet. Consequently, using a general web crawl (e.g., one that employs a random or non-targeted approach) to find and index documents in a particular category becomes computationally expensive because of their relatively low frequency of occurrence among the billions of available Internet documents. For example, commercial product pages are estimated to constitute only 0.5 to 4.0 percent of all web pages on the Internet. Using a general web crawl to index these documents would therefore require 25 to 200 times the resources, hardware, and/or computer processing power compared to an indexing of general (e.g., uncategorized) web pages.