Web search engines operate by storing information about web pages in a search engine index and employing the indexed information to return search results to user search queries. The web page indexing process includes employing web crawlers to retrieve web pages and index information about the crawled web pages. Determining which web pages to crawl and how often to crawl the web pages is a complex problem that can have a significant impact on a search engine. If a search engine crawls web pages too frequently, the search engine places a significant load on the content servers serving the web pages in terms of bandwidth and CPU resources. As a result, webmasters may not welcome the search engine's crawlers, especially if competitor search engines are crawling the web sites more intelligently resulting in fewer crawls. Inefficient crawling also increases the operating costs of the search engine. On the other hand, if a search engine does not crawl web pages frequently enough, the search engine will be slow to capture changes to web pages. As a result, the search engine's search results will suffer in relevance and freshness. If the search engine also does not effectively discover new web pages, the relevance and freshness of the search engine's search results will be further degraded, which in turn will result in a poorer search experience for end users.
For each web page, the crawl frequency (determining when the content will be crawled or re-crawled) is often determined primarily by importance computed for the web page. As the crawl frequency of a web page is based on its importance, important web pages are scheduled to be crawled often (e.g., daily), while less important web pages are scheduled to be crawled less often (e.g., monthly). In some cases, additional information gathered by the crawler may be taken into account to learn if the content is changing often and employed to determine crawl frequency.
Managing a relevant search engine index requires not only crawling new relevant web pages to index new content, but also re-crawling existing web pages to take into account content changes, such as, for instance, in-page content changes, or link changes (the most drastic of these changes being new pages and old, empty or deleted pages). As such, the crawler is really performing two tasks at once: (1) detecting page changes—new/updated content and links, or not; and (2) gathering new/updated content/links.
Unfortunately, the current architecture of typical search engines provides a catch-22. To figure out if a web page has any changes which need to be captured, the search engine has to first crawl the web page to look for changes. This means that for pages that are very important, even if they don't change frequently, the search engine needs to constantly re-crawl them. Unfortunately, in that situation, the search engine has wasted the crawl bandwidth on both the search engine servers and the websites' servers.