Refreshing web pages is a common procedure performed by web crawlers for updating content indexed for use by search engines responding to search queries. Modern search engines may typically rely on incremental web crawlers to feed content into various indexing and analysis layers, which in turn may provide content to a ranking layer that handles user search queries. In general, the crawling layer of a web crawler may download new web pages and refresh web pages that have changing content. Refreshing web pages very frequently may keep content of the web pages updated, but may place an unacceptable burden on the web crawler and may leave few resources available for discovering and downloading new web pages with content not yet indexed.
Although functional, existing refreshing techniques may not be able to efficiently ensure adequate freshness of indexed web page content. First of all, current web page refresh techniques may fail to be selective and may not target important and persistent information. Web pages may be unnecessarily refreshed with unimportant and ephemeral content. Without focusing on important and long-lasting content, web pages with unimportant and ephemeral content such as advertisements or the “quote of the day” may be refreshed for indexing, resulting in a waste of web crawler resources. Second, current web page refresh techniques may fail to be adaptive and may not react to shifting web page change behavior. Refresh techniques may assume static web page change behavior that may result in under-refreshing or over-refreshing a web page over time. Third, current web page refresh techniques may employ global coordination to schedule resources for refreshing web pages and fail to ensure scalability with minimal overhead. Modern web crawlers may apply a high degree of parallel processing by deploying hundreds or thousands of nodes and such global coordination for resource allocation and/or scheduling may be inefficient.
The web page refreshing problem has been studied in the past, starting with simple page change models (e.g., Poisson update process), objective functions (e.g., binary freshness), and adaptivity. See for example, J. Cho and H. Garcia-Molina, Synchronizing a Database to Improve Freshness, In Proceeding of ACM SIGMOD, 2000; E. Coffman, Z. Liu, and R. R. Weber, Optimal Robot Scheduling for Web Search Engines, Journal of Scheduling, 1, 1998; and J. Edwards, K. S. McCurley, and J. A. Tomlin, An Adaptive Model for Optimizing Performance of an Incremental Web Crawler, In Proceeding of the World Wide Web, 2001. Others have studied time-dependent change models and objective functions that take into account search result ranking. See for example S. Pandey and C. Olston, User-centric Web Crawling. In Proceeding of the World Wide Web, 2005; and J. Wolf, M. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen, Optimal Crawling Strategies for Web Search Engines, In Proceeding of the World Wide Web, 2002. Unfortunately, each of these prior models fails to take into account longevity of information, and almost all prior work formulates a global optimization problem and proposes a solution based on some kind of offline optimization procedure.
What is needed is a way to adaptively refreshing a web page. Such a system and method should be able to apply a web page refresh strategy that may be selective, adaptive and local with minimal cross-node communication among processing nodes executing web page refresh scheduling in a distributed system.