Searches among networks and file systems for content have been provided in many forms but most commonly by a variant of a search engine. A search engine is a program that searches documents on a network for specified keywords and returns a list of the documents where the keywords were found. Often, the documents on the network are first identified by “crawling” the network.
Crawling the network refers to using a network crawling program, or a crawler, to identify the documents present on the network. A crawler is a computer program that automatically discovers and collects documents from one or more network locations while conducting a network crawl. The crawl begins by providing the crawler with a set of document addresses that act as seeds for the crawl and a set of crawl restriction rules that define the scope of the crawl. The crawler recursively gathers network addresses of linked documents referenced in the documents retrieved during the crawl. The crawler retrieves the document from a Web site, processes the received document data from the document and prepares the data to be subsequently processed by other programs. For example, a crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A “search engine” can later use the index to locate documents that satisfy specified criteria.
For retrieving documents in a crawl, an operation for each document on the network is executed to get the document and populate the index with records for the documents. These roundtrip queries for documents can consume a great deal of overhead with regard to bandwidth and processor utilization. Also, for accurate results to be provided by the search engine, the index also needs to be accurate with regard to the documents on the network. If the documents on the network change, by altering documents, adding documents, deleting documents, or other operations, the index needs to be updated to reflect these changes. However, crawls of the network can be expensive operations. Having to roundtrip back and forth to the network can overtax the bandwidth available between the indexer and the network.