A search engine is a software program designed to help a user access files stored on a computer, for example on the World Wide Web (WWW), by allowing the user to ask for documents meeting certain criteria (e.g., those containing a given word, a set of words, or a phrase) and retrieving files that match those criteria. Web search engines work by storing information about a large number of web pages (hereinafter also referred to as “pages” or “documents”), which they retrieve from the WWW. These documents are retrieved by a web crawler or spider, which is an automated web browser which follows every link it encounters in a crawled document. The contents of each document are indexed, thereby adding data concerning the words or terms in the document to an index database for use in responding to queries. Some search engines, also store all or part of the document itself, in addition to the index entries. When a user makes a search query having one or more terms, the search engine searches the index for documents that satisfy the query, and provides a listing of matching documents, typically including for each listed document the URL, the title of the document, and in some search engines a portion of document's text deemed relevant to the query.
While web pages can be manually selected for crawling, such manual assignment becomes impracticable as the number of web pages grows. Moreover, to keep within the capacity limits of the crawler, web pages should be added or removed from crawl cycles to ensure acceptable crawler performance. For instance, as of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents.
Therefore, what is needed is a system and method of automatically selecting and scheduling documents for crawling based on one or more selection criteria. Such a system and method should be able to assess the stature (e.g., page rank) of a web page and schedule the web page for crawling as appropriate based on its stature.