The WWW (World Wide Web) can be considered as a huge data repository which is tremendously considered as having a very important business value. It is therefore needed to provide companies active in the internet field with tools to create this value out the resources available on the web. These companies may provide services dedicated to individual users (like search engines, for instance) or to other companies in a BtoB (business to business) model, like gathering of marketing data in particular business field, etc.
In order to be able to analyze information and to valorize it, a first and mandatory step is to retrieve information available on the web, and to build from them a “web corpus”, i.e. a set of resources on which dedicated computer programs will be run. These web corpuses may be generalist as in the case of a generic search engine, or more narrowed to a given business area or thematic.
Retrieving information, e.g. resources (web pages, multimedia files, etc.), from the web is a time-consuming task. The delay to retrieve a single resource may take hundreds of milliseconds to seconds. This delay is also unpredictable as it depends on the health of the website and of the underlying communication networks.
Also, there is no global view of the resources available on the web. So, in order to build this view, for instance to reply to a query inputted by a user of a search engine, there is a need to perform an iterative process by visiting first resources, and then visiting resources which are referred to in these resources, etc. until it is considered to have got a sufficient view of the web.
In doing this process, the delays are accumulated and the final delay to be able to answer user's request is not reasonable.
Web crawlers have been introduced to avoid this delay to search engines or any other computer programs that need to access a large number of resources.
Web crawlers are programs used to find, explore and download resources available on websites of the Web so as to constitute a corpus, i.e. a set of resources that could be used by other programs. They are also called ants, bots, web spiders . . . . In the following, they will be referred to as “web crawlers” or more simply as “crawlers”.
More precisely and in general, a crawler starts with a list of URLs (Unified Resource Locators) to visit, called “seeds”. As the crawler visits the resources identified by these URLs, it identifies all the URLs contained by the resource (in the form of hyperlinks) and adds them to the list of URLs to visit. These URLs are then recursively visited, while the corresponding resources are downloaded to progressively build a web crawl.
A web crawl is here defined as the digital contents stored by the web crawler.
These web crawlers are prominently used by search engines, like shown in FIG. 1.
A web crawler WC crawls the Web and builds a web crawl WCD, which is a repository of downloaded resources. An indexing program IDP is using this web crawl WCD in order to build an index ID.
This indexing program IDP may comprise a processing pipeline aiming at analyzing the raw resources of the web crawl WCD to transform them in “objects” compliant with a format more adapted for indexing. For instance, it may suppress parts of the content of certain downloaded resources (like advertisement banners, images, etc.) and/or look for certain data inside the downloaded resources to put them in specific fields of the objects to be indexed, etc.
The indexing program IDP also processes the “objects” or the raw resources to store items associated to them so as to fasten treatment of queries.
When a user U initiates a query with a search engine SE, it looks into the index ID to retrieve items which match the criteria of the query. These items are then presented to the user U, who can then choose to download or not the resources corresponding to the presented items (for instance by clicking on a hyperlink associated with an item).
The web crawl WCD can also be used by other computer programs Prog, such as batch analysis programs, for instance by means of graph modeling.
Therefore, Web crawlers enable to decouple resources retrieval from processing and applications. Delays due to resource retrieval from the web do not impact the responsiveness of the computer programs Prog, SE, nor the real-time syntheses of the index ID. The delays only impact the information available at a certain time (i.e. the downloaded resources). More specifically, they affect the time for a change on the corpus (new resource, deleted resource or modified resource) to be visible on the index ID.
It means that applications are not directly dependent on the scheduling of the resource retrieval task performed by the crawler. Delays and time constraints linked to this task may only impact the amount of information (i.e. downloaded resources) available at a certain time, as well as its age and freshness.
It also means that web crawlers can constitute meta-data over the data downloaded from the web. More precisely, a single index field may require information that is not found on a single resource, but is provided by the analyses of multiple resources. In addition, the PageRank algorithm of the company Google uses a graph representation of the hyperlinks between resources. Building this graph requires an examination of every resource of the corpus.
In general, the construction of an index requires multiple accesses to the same resource. Without a web crawl, the delay to retrieve a resource from the web will be felt several times.
Also, it is sometimes needed to change the structure of the index, in part or entirely. In order to avoid the delays of resource retrieval, the indexing program IDP can use the downloaded resources available in the web crawl WCD instead of downloading them from the web.
Despite this decoupling, the delays involved in the web crawling task remain a bottle-neck and some works have been undertaken either to reduce the time needed to reflect on a web crawl the changes within a web corpus, or to focus the web crawling on the most relevant changes first.
However, these efforts mainly address the issue to capture changes within a web corpus and to reflect them within the web crawl with the smallest delay.
They do not address the problem of initially building a new web corpus.
Web crawling remains a very slow process for at least the following reasons:                There is a limited crawl frequency authorized by “netiquette”: In order to avoid overloading website with traffic linked to web crawlers, it is generally admitted that a crawler will access the same host website (or host) less frequently that once every 2.5 seconds. In addition, websites may enforce their own policy and may event refuse to serve a crawler that overpasses the admitted frequency. In such cases, the crawler maybe temporary or definitively barred to access the website again.        As mentioned earlier, websites generally take hundreds of milliseconds to seconds to answer a request.        The crawling process is not parallelizable. URLs found on a resource are often used to determine new resources to visit. In this case, resources cannot be downloaded in parallel and delays add up.        
On top of that, even if a web corpus is needed in a narrow area the web crawling process should consider a very large amount of resource, including resources not related to this narrow area. The reason for this is that the crawling process is a non-selective one:                For applications where only a subset of the web is interesting, a crawl of the whole web is still required because interesting resources may be referenced by non-interesting resources. In other words, if uninteresting resources are filtered out, many interesting resources may be overlooked.        The decision whether a resource is interesting or not can only be taken after the resource has been crawled because the information provided by the resource's URL and the resource that references it is less than the information provided by the resource itself.        
It could be possible to multiply the hardware resource to reduce the time needed to build a web crawl. However, this solution is not scalable and very costly. Also, as there exists dependencies between the tasks of the process, it will not be an entirely satisfactory solution in term of gain of time: even with infinite processing resources, it would take months to crawl a substantial portion of the web. This has been shown for example in the article “Accessibility of Information on the Web” of Steve Lawrence and C. Lee Giles, published in Nature vol. 400, pp. 107-109, 1999.