The Internet has grown to allow the user to access a plethora of information—from the latest news, watching movies on-line to checking a bank account balance through on-line banking, to ordering an airline ticket or a meal from a corner Chinese take-out place. In some situations, the user knows a particular web site that she is wishing to access. For example, when the user wishes to do her on-line banking with the Royal Bank of Canada, the user knows to access web site www.rbc.com. In other circumstances, the user may not be aware of a particular web site that addresses his needs and he may need to perform what is known a web search using one of search engines, such as YANDEX, GOOGLE, YAHOO! or the like. As is known, the user enters a search query and the search engine provides a list of web resources that are responsive to the search query in what is known as a Search Engine Results Page or SERP, for short.
As is also known in the art, in order to be able to include a particular web resource into the SERP, the search engine needs to “visit” the web resource and to index the information contained therein. This process is generally known in the art as “crawling” and the module associated with the search engine server responsible for the indexing is generally known as a “crawler” or a “robot”.
Naturally, new web resources appear every day in ever-increasing numbers. It is a well established fact that none of the commercially available search engines is able to crawl every web resources as soon as it appears. This is due to the limited resources available at each of the search engines—after all, the search engine is typically a business venture and needs to operate its business in a prudent and cost-effective manner—hence, there no such thing as unlimited supply of computational power/equipment at any given search engine.
What tends to exacerbate the problem is that the content of web resources changes from time to time. The frequency of this change in information may change from one web resources to another web resource—it may be relatively fast (for example, a news portal may update content a several times in a given day) or relatively slow (for example, a home page of a major bank may rarely be updated and, even when updated, changes are mostly cosmetic in nature), but it does change nevertheless.
Therefore, it is known in the art to create a crawling schedule, which crawling schedule is followed by the crawler when crawling new resources or re-crawling previously crawled web resources for the updated content. Generally speaking, the crawling schedule is a strategy of a crawler to choose URLs to visit (or revisit) from a crawling queue. As such, the crawling schedule is known to prescribe the crawler: (i) when to download newly discovered web pages not represented in the search engine index and (ii) when to refresh copies of pages likely to have important updates and, therefore, change from the content saved in the search engine index.
U.S. Pat. No. 7,899,807 published on Mar. 1, 2011 to Olsten et al discloses an improved system and method for crawl ordering of a web crawler by impact upon search results of a search engine is provided. Content-independent features of uncrawled web pages may be obtained, and the impact of uncrawled web pages may be estimated for queries of a workload using the content-independent features. The impact of uncrawled web pages may be estimated for queries by computing an expected impact score for uncrawled web pages that match needy queries. Query sketches may be created for a subset of the queries by computing an expected impact score for crawled web pages and uncrawled web pages matching the queries. Web pages may then be selected to fetch using a combined query-based estimate and query-independent estimate of the impact of fetching the web pages on search query results.
U.S. Pat. No. 7,672,943 published on Mar. 2, 2010 to Wong et al teaches a web crawler system that utilizes a targeted approach to increase the likelihood of downloading web pages of a desired type or category. The system employs a plurality of URL scoring metrics that generate individual scores for outlinked URLs contained in a downloaded web page. For each outlinked URL, the individual scores are combined using an appropriate algorithm or formula to generate an overall score that represents a downloading priority for the outlinked URL. The web crawler application can then download subsequent web pages in an order that is influenced by the downloading priorities.
US patent application 2012/0303606 published on Nov. 29, 2012 to Cai et al discloses web crawling polices that are generated based on user web browsing statistics. User browsing statistics are aggregated at the granularity of resource identifier patterns (such as URL patterns) that denote groups of resources within a particular domain or website that share syntax at a certain level of granularity. The web crawl policies rank the resource identifier patterns according to their associated aggregated user browsing statistics. A crawl ordering defined by the web crawl polices is used to download and discover new resources within a domain or website.