This disclosure relates generally to content discovery using a data processing system and more specifically to crawling rich Internet applications using the data processing system.
Web application crawling is a basis of content indexing. To index and find new information, a search engine should be able to crawl the web applications efficiently. Crawling is also critical to tools that perform analysis of the web applications, for example, for security, compliance and accessibility testing.
The introduction of newer and richer technologies for web application development has provided web-applications, which are more useable and interactive. These applications referred to as rich Internet applications (RIAs) have changed traditional web applications resulting in the more responsive applications with an improved user experience.
Rich Internet applications computations can be processed on the client-side using scripts that enable a user to modify the user interface (partially or completely) by triggering events defined on the user interface components (for example hypertext markup language (HTML) elements). Asynchronous communication enables the user to retrieve parts of the web pages, and enables web designers to start building very complex and interactive web applications.
The improvements while typically increasing the usability of web applications introduced many web application crawling challenges. A significant challenge is that traditional crawling techniques are no longer compatible with web applications built using the new technologies, because a universal resource locator (URL) does not change and the crawling engines must use a document object model (DOM) of the web page to infer information regarding the state of the application. Crawling rich Internet applications is a problem that needs to be addressed to maintain the ability to search and test the web applications.