Search engines may include web crawlers that automatically visit web pages on the World Wide Web (the Web) to create an index of the content of the web pages. For example, a web crawler may start with an initial set of web pages having known URLs. The web crawler downloads the web pages and extracts text and metadata reflecting the content of the web pages. The web crawler also extracts any new URLs (e.g., hyperlinks) contained in the downloaded web pages, and adds the new URLs to a list of URLs to be scanned. As the web crawler retrieves the new URLs from the list, and scans the new web pages corresponding to the new URLs, more text and metadata is extracted, and more URLs are added to the list. The text and metadata collected from the scanned web pages may be used to generate a searchable index for providing search services.
The Web has become very large and is estimated to contain over one trillion unique URLs. Further, crawling, storing and indexing web pages are resource-intensive processes, which use a large amount of both computing resources and storage resources. Thus, not all web pages on the Web are crawled. In addition, of the web pages that are crawled, search engines typically cannot index all of the collected information due to resource limitations. Thus, search engines may select only some of the crawled web pages for indexing.