A web crawler needs to have a means of identifying client states of a web application that it's crawling, to be able to deduce a reasonable model of the web application as a state machine.
Traditionally, a universal resource locator (URL) was used as a unique identifier of a state of a web application, since in a world of traditional web applications one safely assumed there was a one-to-one correlation between client states of a web application and associated URLs. Crawling rich Internet applications that frequently use a grouping of technologies collectively referred to as asynchronous JavaScript® and extensible markup language (XML) asynchronous JavaScript and XML (AJAX) calls, however, has been a challenge since in these web applications, the URL does not uniquely identify a client state of the web application. As a result, the crawler needs to have an alternative method of identifying client states.
Current solutions typically take a snapshot of a document object model (DOM) of a client state at a particular time, exclude irrelevant and unimportant data from the captured data of the snapshot, and assign a unique identifier to a resulting DOM, for example, by computing a hash of the DOM of the respective instance. Using this technique, whenever the crawler encounters a DOM, which was not encountered previously, the crawler considers the new DOM as a new state of the web application.
However, a user interface of a rich Internet application is typically composed of several different parts, each of which interact with a user independently of other parts. For example, in a case of mashup applications and webpages containing widgets, individual applications and widgets typically act independently. As a result, different combinations of widgets can easily be displayed to the user in various ways, forming new DOMs, which were never before encountered.
A typical drawback of current solutions is a state explosion because the occurrences of DOMs, which were never encountered before and are considered new application states, are really just a new combination of parts of an already-seen DOM. Accordingly the independent property of the parts of the user interface can easily lead to a state space explosion as the number of independent parts in a web application increases, effectively reducing the possibility of the web crawler to finish crawling a particular web application when the crawler uses a current state definition approach.
A simple example illustrating the problem is shown in FIG. 1 through FIG. 4 of a user interface building tutorial which uses a plurality of widgets.
With reference to FIG. 1 a screenshot of a portion of a user interface in which DOM 1=State 1 is presented. Consider an application of the example containing only 3 widgets, as shown in FIG. 1, in which an initial state is identified as DOM 1=State 1. Each time one of widget 1 102, widget 2 104 or widget 3 106 is opened, content of the opened widget is retrieved from a server and inserted into the DOM of the page represented in user interface 100.
With reference to FIG. 2 is a screenshot of a portion of user interface 100 in which DOM 2=State 2 is presented. When a crawler expands widget 1 102 on the page, as shown in FIG. 1, a new state occurs as shown in FIG. 2. FIG. 2 accordingly illustrates DOM 2=State 2 for a portion of the application.
With reference to FIG. 3 a screenshot of the portion of user interface 100 of FIG. 2 in which DOM 3=State 3 is presented. Furthermore, when the crawler expands widget 2 104, in addition to widget 1 102 as in FIG. 2, a further new state results as shown in FIG. 3 in which DOM 3=State 3. Note that State 3 however is simply a combination of the expansion of the prior two widgets, widget 1 102 and widget 2 104, which were opened previously.
With reference to FIG. 4 a screenshot of a portion of user interface 100 in which DOM 4=State 4 is presented. When the crawler returns to State 1 and expands only widget 2 104, a new state in which DOM 4=State 4 as illustrated in FIG. 4 occurs.
The example illustrates how the three widgets on a page of the web application can easily lead to a state explosion, since every widget independently may have other internal states, and respective combinations of widgets can effectively disable a web crawler because of the occurrence of a plurality of states. In analyzing the states of the example, a human user can readily perceive information contained in State 3 of FIG. 3 is redundant, simply consisting of a union of State 2 and State 4. However, the combination is not apparent to a web crawler using a complete DOM as a means to create states. Each state is accordingly treated as a unique DOM instance.