The enormous popularity of the World Wide Web, simply referred to as the Web, has made available a vast amount of information. However, without applications being available to process the information available from the Web, and therefore extract useful information, full benefit can not be derived from the available information. Accordingly, several applications have become available that process the available information and provide useful insights with regards to that information. An essential part of many of such applications is crawling, which is hypertext resource discovery.
The process of crawling may be divided into the follows steps:                1. Providing to the crawling application a set of Universal Resource Locators (URLs), called seed URLs, and an integer k, known as depth. The seed URLs are placed by the crawler application in a queue of un-crawled URLs.        2. For each URL in the queue of un-crawled URLs the crawler application performs the steps of:                    a) fetching the Web page associated with the URL;            b) extracting all the hyperlinks present in the fetched Web page;            c) passing the Web pages referenced by the hyperlinks through exclusion/inclusion patterns;            d) determining whether the Web pages that passed through the exclusion/inclusion patterns were already crawled, and if not, placing the URLs of the un-crawled Web pages into the queue of un-crawled URLs.                        
The set of seed URLs along with depths are derived manually. The inclusion/exclusion patterns, which act as filters on the Web pages to determine their relevance, are also derived manually and provided to the crawler application. As an example, the term “.gif” may be designated as an exclusion pattern, which will ensure that GIF images are not fetched during the crawling process.
It is noted that the crawling process terminates once the given depth for each seed URL is reached, as the already crawled Web pages are not put back in the queue of un-crawled URLs once the given depth is reached for that seed URL. This strategy of crawling as may be referred to as blind crawling, as crawling involving this strategy collects all the Web pages that are accessible from the set of seed URLs at a distance of equal or less than the given depth for each seed URL.
However, due to the exponential growth of the Web blind crawlers are challenged by the scale of the Web. Accordingly, methods are required that can smartly crawl the Web pages of interest with minimal computing and network resources.
Two variations to the blind crawling strategy that have been proven useful for several applications in the attempt to minimize the computing and networking resources are as follows:                1. In several applications, only a fraction of the crawled pages are relevant for the application. Techniques named “focused crawling” have been proposed to reduce the computing and networking resource requirements in such scenarios. Rather than collecting all the Web pages accessible through the set of seed URLs, a focused crawler finds the hyperlinks that are likely to be most relevant for the application at hand. Several methods have been proposed to select the hyperlinks that are likely to be relevant from a Web page;        2. Several applications require the set of Web pages already crawled to be crawled again at regular intervals in order to ensure the information, and the crawlers knowledge of the hyperlinks on those Web pages, are up-to-date. These applications crawl the set of Web pages of interest and then re-fetch each Web page separately at the determined refresh rate for the Web page. The refresh rate for any Web page is typically determined based on the history of the changes on that Web page. The re-fetch of a Web page in this setting in fact makes a connection to the URL of the Web page followed by fetching the page. If the links on the crawled pages are not fetched then this process may be referred to as “re-fetching”. If all the Web pages that are accessible from the crawled Web page are also crawled then this process may be referred to as “re-crawling”.        
The above strategies have led to significant savings in computing and networking resources, and have helped to keep the crawls more up-to-date.
The URLs associated with hyperlinks on a Web page may be categorized in two different types, namely:
1) URLs that are present always or for relatively long durations, and therefore referred to as stable URLs; and
2) URLs that exist only for a smaller duration, referred to as temporary URLs.
It is noted that the content of stable URLs could still change with time. In contrast temporary URLs exist only for a short time on a Web page. To illustrate the difference between stable and temporary URLs, consider for example the URL of a home page of a news site. That home page URL would be a stable URL. However, a URL of a Web page containing a news item will be a temporary URL, as that URL will only exist for a short time while that news item has relevance. A set of new temporary URLs is typically created each day, with those URLs containing news items of that day.
Many applications require focused crawling followed by re-fetching. Focused crawling is required to be able to collect only the pages that are relevant for the application at hand and re-fetching may be required to ensure that the crawler application is up-to-date. It is important to note that for re-fetching to be meaningful only the content of Web pages should change and URLs of Web pages should not change i.e. Web pages should have stable URLs. If the set of relevant pages contains temporary URLs then re-fetching can not be done as the temporary URLs would cease to exist on the Web page. This makes the re-crawling of the sites inevitable for ensuring an up-to-date crawler application.
While several methods have been proposed to identify potentially relevant URLs in a Web page, focused crawling in general has not been found to be very useful in practice. The main reason for the lack of utility of focused crawling to date is that it is difficult to determine the relevancy of a Web page without considering its content.
US Published Patent Application No. 20050086206 describes a method of “focused crawling” that aims at gathering documents that are relevant to any of the given “focused topics”. The rules used to detect the relevance of a page by that method are predetermined, and remain unchanged during subsequent crawling.
US Published Patent Application No. 20030149694 discloses path-based ranking of unvisited Web pages for Web crawling, via identifying all the paths beginning with a seed URL and leading to visited relevant Web pages as a “good-path set”, and for each unvisited Web page, identifying the paths beginning from the seed URL leading to such a Web page as a “partial-path set”. All the visited web pages are then classified and each web page is labelled with the labels of a class or classes it belongs to. A statistical model is also trained for generalizing the common patterns among all ones of the “good-path set”. Finally, the “partial-path set” is evaluated with the statistical model and the unvisited web pages are ranked with the evaluation results.
Hence, the method disclosed in that application ranks the unvisited Web pages with a relevance scores obtained from the statistical model learnt by the path URLs of relevant pages. The ranked list of Web pages helps a crawler in fetching the relevant pages first.
US Published Patent Application No. 20040030683 discloses a system which conducts subsequent extensive searches (referred to as recrawl) of previously encountered web sites to update a database (e.g., update a web site's respective site map, update the directory of encountered sites, delete a site map, delete a URL from the directory). The system utilizes statistically and/or heuristically determined criteria to conduct subsequent searches to ensure the accuracy of the system's database. The system is suited for searching and retrieving network based content by using metadata about the content.
US Published Patent Application No. 20050138056 discloses a user adding inclusion constraints and exclusion constraints which create a boundary that specifies which documents are sought and which are not. A system is presented that could be used to define a “working set” of the corpus that is a subset of the overall corpus obtained in response to a query. A mechanism for the visual representation of the working set is also provided. That method is not related to crawling.