A search engine answers user queries using an index of documents in its documents database. It follows that in order for a search engine to provide results that are relevant to a query, the search engine's index must have content, or documents, that are relevant to the query.
Typically, index and documents databases are populated using a crawler, or crawlers, and an indexer. A crawler crawls links and fetches content, including web pages, or documents, or other content from a network, such as the internet or web. An indexer creates an index of the fetched documents. To fetch documents, a crawler usually operates on a prioritized list of links corresponding to the documents. The crawler can run continuously, or can run less frequently. An indexer typically runs periodically, such as once a week, and builds an index database. After the indexer builds the index database, the index database can be used by a search engine to identify documents relevant to a search request.
A crawler has a limited capacity, which may delay or limit the links crawled and/or the documents fetched by the crawler. While crawler can prioritize its operation by prioritizing the links that it crawls, this may not be sufficient to ensure that relevant documents are not missed by the crawler. The crawler may not properly prioritize the documents, which can result in less relevant documents being fetched and more relevant good documents being missed or fetched later. Additionally, some sites have policies to block or limit the maximum connections per fetcher, which can result in delays or prohibitions in fetching discovered pages. These problems translate into an increase in delays, or prohibitions, in providing documents for searching, and result in the search engine providing less than satisfactory performance, e.g., results that lack the most relevant documents.