This disclosure relates generally to indexing of web pages in a data processing system and more specifically to partitioning a crawling space for distributed crawling in the data processing system.
Distributed crawling of web applications has been a topic of extensive research for approximately twenty years. See, for example, Web Crawling, 2010, By Christopher Olston and Marc Najork. However, typical proposed solutions focused on crawling of conventional or non-AJAX applications. See, for example, UbiCrawler: A Scalable Fully Distributed Web crawler, Jan. 27, 2003, Paolo Boldi, Bruno Codenotti, Massimo Santini, Sebastiano Vigna; and Distributed Web Crawling over DHTs, 2004, Boon Thau Loo Owen Cooper Sailesh Krishnamurthy. The term “AJAX” generally refers to a collection of web-based technologies used to implement web applications capable of communicating with a server in the background, while not interfering with a current state of the web page. AJAX implementations typically include combinations of technologies including hypertext markup language (HTML) or extensible hypertext markup language (XHTML) with cascading style sheets used for presentation purposes, a document object model (DOM) for the dynamic display of data and interaction with data, extensible markup language (XML) for data definition and interchange and extensible markup language (XML) for the interchange of data, and extensible stylesheet language transformations (XSLT) for data transformations, extensible markup language hypertext transport protocol request (XMLHttpRequest) objects providing a capability for asynchronous communication and JavaScript providing a “glue” language for combining the technologies.
In non-AJAX applications, a one-to-one correspondence between a state of a document object model (DOM) and a corresponding universal resource locator (URL) exists. Thus, traditional crawlers typically use matrices primarily using the URL to partition a search space. See, for example, Design and Implementation of a High-Performance Distributed Web Crawler, 2002, Vladislav Shkapenyuk and Torsten Suel. Using the described framework each crawler is responsible for a specific set of URLs in which a particular crawler is responsible to go to an original URL and obtain information regarding new URLs located using the original URL. When a newly discovered URL falls within a set of URLs allocated to another crawler node, the first node communicates with the second node to inform the second node about the newly discovered URLs by the first node.
Increased use of interactive and AJAX based JavaScript libraries has caused the number of AJAX enabled rich Internet applications (RIAs) to increase rapidly. In the RIA type of applications, a one-to-one correspondence between a state of the DOM and the URL does not exist as in the non-AJAX applications. Therefore, techniques used in traditional crawlers typically do not work, or do not work well, in such applications. For example, a crawler may not be able to reach all states merely by sending the URL, while partitioning of the search space may not continue to be based on the URL.
When processing an RIA type application, a crawler may execute an event to materialize a new page, in contrast with typical traditional applications in which a crawler could simply view a destination URL and identify a node responsible for the URL. Further, when processing a non-AJAX application, a cost of reaching a page in traditional website is typically constant because at any point a crawler can simply jump to any page having the URL. The cost varies, however, when processing an RIA because reaching a state in a sequence of events is typically followed by an increasing variable cost. The cost function associated with crawling web pages is therefore an important factor with regard to crawler performance matrices associated with coverage and timeliness of coverage.