Search engine providers download information about web pages and other resources so that the information can later be searched by search engine users. Web crawlers are employed to download the information, but it is not feasible for a web crawler to download and store all content from all websites on the Internet. Web crawl policies therefore attempt to identify Internet information resources (such as web pages, media files, and others) that are most interesting to users, so that the “interesting” information can be given priority in downloading and storing. In practice, this often means identifying the Uniform Resource Locators (URLs), Uniform Resource Identifiers (URIs), or other types of resource identifiers associated with interesting resources within one or more websites. This is done so that information about the interesting resources can be downloaded and stored for later search engine retrieval. Various web crawl ordering policies differ in part in how they identify interesting resources.
Many web crawl ordering policies attempt to determine “interesting” web pages based on web-link structure. For example, a “breadth-first” policy is based on an assumption that pages located within a few links to a website's portal are more likely to be interesting to users. Other strategies like “in-degree” and “PageRank®” take into account sophisticated navigational structure information. Some policies prioritize web pages according to content's relevance to predefined semantic topics. Other policies consider the rank of a web page in returned search engine query results and the number of actual user clicks that the web page receives in the search results (e.g., URLs with lots of clicks and higher ranks are promoted in crawl order). Other approaches attempt to optimize the crawl policy on a per-website basis.
The Internet is increasingly dynamic. For example, much modern website content is user-generated, websites are organized with deep link levels, and web pages are created and retired much more rapidly than in the past. Conventional web crawl policies do not handle this dynamism well.