Search engines provide a powerful source of indexed documents from a network, such as the Internet, that can be rapidly scanned. To maintain freshness of the documents in the search engine's index, at least some of the indexed documents need to be rescanned frequently, and all or many of the remaining indexed documents need to be rescanned periodically, but somewhat less frequently. Scanning also results in the discovery of new documents, because revised documents will contain links to such new documents, and therefore frequent rescanning is required in order to bring new documents into a search engine index on a timely basis. If the number of indexed documents is large (e.g., billions of documents), accomplishing such scanning in a timely manner requires the use of multiple network crawlers (or web crawlers) operating in parallel.
The host servers of many web sites require a requester to have possession of one or more cookies in order to gain access to some or all of the documents on those web sites. Cookies are typically implemented as files stored on the requester's computer that indicate the requester's identity or other information required by many web sites. The terms “cookie” and “cookie file” may be used interchangeably. Cookies may include information such as login or registration identification, user preferences, or any other information that a web host sends to a user's web browser for the web browser to return to the web host at a later time. The many uses of cookies, and the mechanisms for creating, using, invalidating and replacing cookies are well known to those skilled in the art, and are beyond the scope of this document.
Conventional network crawlers have no facility for obtaining such cookies, nor for handling various cookie error conditions. As a result, conventional web crawlers are unable to crawl a full set of pages or documents in web sites that require cookies, thereby reducing the amount of information available through use of such search engines. In addition, conventional network crawlers have no facilities for coordinating the efforts of a parallel set of network crawlers with respect to crawling a full set of pages or documents in web sites that require cookies. There is a need, therefore, for an improved search engine that uses multiple crawlers to access web sites that require cookies.