The Web is the biggest information repository ever built. Currently, it consists of thousands of millions of pages and it is continuously growing. Due to its large size, it is indispensable to use search engines to access the information which is relevant to the user. Search engines are complex systems that allow several operations to be performed with the information, such as collecting, storing, managing, locating, and accessing it. Systems that perform the task of gathering data are the crawlers. The crawling systems or crawlers are the programs that traverse the web following URLs to obtain the documents to be later indexed by the search engines.
Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge.
Web crawlers are Internet bots that are programmed to traverse the whole web or a specific set of pages previously configured and store the web pages downloaded in a repository which is later indexed by the search engine. Once finished, the crawler must start over again to keep updated the web pages indexed. A challenge of this task is that not all the pages change at the same rate. Some pages are highly variable and the time spent between two consecutive crawls may be too high to keep the repository updated. To deal with this difficulty, crawling systems implement recrawling policies that decide when it is necessary to revisit and update the content of a web page indexed by the search engine. These recrawling policies are often based on the change history of each web page and its relevance.
The underlying problem in the recrawling process is to know when the content has changed. Ideally, a crawler would have to recrawl a page just after it has been modified. If it is accessed before if has changed, the visit is useless. However, the longer it takes the crawler to revisit a modified page, the information becomes more stale and probably less relevant to the end user.
Currently, most crawling systems try to guess, estimate, or predict when a web page has changed. There are some disclosures that present methods to maintain the repositories of search engines updated. For example, U.S. Pat. No. 8,078,974 proposes to detect web page changes with revisitation patterns. The patent proposes to analyze the change data to produce a change characterization, with the change data reflecting differences between the content of a web page at different times. The revisitation data is analyzed to produce a revisitation characterization, with the revisitation data including visit times to the web page by a user. A relationship is determined between the change and the revisitation data based on the change and the revisitation characterizations. Similarly, U.S. Pat. No. 7,310,632 describes a method based on a decision-theoretic component that determines an appropriate time to crawl a web page and makes predictions regarding the changes in a web page. The predictive analysis is based on a) the utility a web page, b) historical data, and c) content contained. These disclosure teach that this problem is solved by keeping the repositories updated by trying to estimate when the web resource may have changed. However, they are not able to detect when a web resource has actually changed.
There are some other approaches that use a distributed architecture, such as the one disclosed in U.S. Pat. No. 5,978,842. In this case, the user registers a web page that wants to be monitorized in the server and must install a client-side change detection application. The server of the system assigns a date and time for the client to perform change detection. At the assigned time and date, the client fetches a new copy of the web page and compares it to an archived copy to detect changes. As more users are registered for a web page, change detection is performed more frequently by the users. This method requires the user to install a client-side change detection application and also requires extra access to the web page in order to detect if any change has occurred so that the user can be notified of changes in a specific web page.