Web crawling is a well-studied problem. The crawling problem has three main aspects: discovery of new URLs, acquisition of the content associated with a subset of the discovered URLs, and periodic synchronization of previously acquired pages to maintain freshness. Prior work on the acquisition of the content associated with a subset of the discovered URLs focused on ordering pages according to a query-independent notion of page importance. See for example, S. Abiteboul, M. Preda, and G. Cobena, Adaptive On-line Page Importance Computation, In Proceeding of WWW, 2003; J. Cho, H. Garc'ýa-Molina, and L. Page, Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172, 1998; and M. Najork and J. L. Wiener, Breadth-First Search Crawling Yields High-Quality Pages, In Proceeding of WWW, 2001. In particular, web page fetching has been prioritized by query-independent features such as link-based importance or PageRank. Unfortunately, query-independent importance measures do not provide the best prioritization policy for a search engine crawler.
The problem with using a query-independent importance measure to do crawl prioritization is that it only accumulates content on well-established topics whose pages have many links. However, the number of tail queries, that is queries that lie in the tail of the query frequency distribution, seen by search engines today is too large to ignore. Other approaches to crawl prioritization include focused crawling. See for example, S. Chakrabarti, M. Van den Berg, and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery, In Proceeding of WWW, 1999. However, focused crawling scours the Web in search of pages relevant to a particular topic or a small set of topics. Such focused crawling is guided by topic classification rather than the relevancy of queries issued by user requests.
What is needed is a way to bias web crawling toward fetching web pages that match any topic for which the search engine currently does not have enough relevant, high-quality content as requested by users.