A search engine is a tool that identifies documents, typically stored on hosts distributed over a network, that satisfy search queries specified by users. Web-type search engines work by storing information about a large number of web pages or documents. These documents are retrieved by a web crawler, which then follows links found in crawled documents so as to discover additional documents to download. The contents of the downloaded documents are indexed, mapping the terms in the documents to identifiers of the documents and the resulting index is configured to enable a search to identify documents matching the terms in search queries. Some search engines also store all or part of the document itself, in addition to the index entries.
In such web-type search engines, web pages can be manually selected for crawling, or automated selection mechanisms can be used to determine which web pages to crawl and which web pages to avoid. A search engine crawler typically includes a set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers for crawling for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled or scheduled for crawling during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in one or more previous crawls. Other filtering and scheduling mechanisms may also be used to filter out some of the document identifiers in the starting set, and schedule the appropriate times for crawling others. As such, any number of factors may play a role in filtering and scheduling mechanisms.
Accordingly, a need exists for a generic scheduling process that addresses these variables and allows for customized scheduling of such sources, including gathering content related to a specific source.