A search engine is a tool that identifies documents, typically stored on hosts distributed over a network, satisfying search queries specified by users. Web search engines work by storing information about a large number of web pages (hereinafter also referred to as “pages” or “documents”), which they retrieve from the World Wide Web (WWW). These documents are retrieved by a web crawler. The web crawler follows links found in crawled documents so as to discover additional documents to download. The contents of the downloaded documents are indexed, mapping the terms in the documents to identifiers of the documents. The resulting index is configured to enable a search to identify documents matching the terms in search queries. Some search engines also store all or part of the document itself, in addition to the index entries. When a user submits a search query having one or more terms, the search engine searches the index for documents that satisfy the query, and provides a listing of matching documents, typically including for each listed document the URL, the title of the document, and in some search engines a portion of document's text deemed relevant to the query.
To keep within the capacity limits of the crawler, automated selection mechanisms are needed to determine not only which web pages to crawl, but which web pages to avoid crawling. For instance, as of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents.