The present invention relates generally to the field of web site crawling, and more particularly to improving web crawling efficiency by clustering JavaScript events using common structures of interactive web sites. (Note: the term “JavaScript” may be subject to trademark rights in various jurisdictions throughout the world and are used here only in reference to the products or services properly denominated by the marks to the extent that such trademark rights may exist.)
The use of web application technology, such as the use of asynchronous JavaScript and XML (AJAX) techniques, on client side web application, is changing the web experience from web pages having a unique universal resource locator (URL), to highly dynamic and interactive web pages with a common URL. Technologies such as those included in AJAX techniques allow web applications to send and retrieve data without refreshing the current display. The interactive and dynamic web page behavior poses a great challenge for web crawlers to automatically navigate web pages and web sites that employ such techniques.
Web crawling is the process of browsing a web application in a methodical, automated manner, or in an orderly fashion. Traditional crawling techniques are not sufficient for web applications built using rich Internet application (RIA) technologies. In traditional web application, a page is defined by its URL and all the pages reachable from the current page have their URL embedded in the current page. Crawling a traditional web application requires to extract these embedded URLs and traverse them in an effective sequence. But in RIAs, the current page can change its state dynamically, sometimes without even requiring user input, and hence cannot be mapped to a single URL. All these changes mean that traditional crawlers are unable to efficiently crawl RIAs, except for a few pages that have distinct URLs.
For example, an AJAX web application may contain hundreds of JavaScript events on which a user interacts to navigate into a new state of the site, in which a site state is a presentation of particular content. To explore all possible states, a web crawler needs to execute all JavaScript events in all combinations, which is not feasible for web sites with many web pages interconnected with multiple links. In some cases, combinations of JavaScript events lead to similar webpage states.
Crawling is an important aspect of the existence of the web. An important functionality of the web in general is the information it provides, and the information can only be made available if the different information sources can be found and indexed. If search engines are not able to crawl websites with new information, they will not be able to index them.