A website crawler is a tool that performs an automatic exploration of a website. This task is beneficial for many applications from a simple indexing of information task, to more complex compliance testing, to name a few.
One of the challenges faced by automated tools is an ability to determine whether two JavaScript events on a page perform equivalent actions. Equivalent in this sense means, executing the two JavaScript events independently to create a document object model (DOM) of the page in which the two states are equivalent. Determining whether two JavaScript events are equivalent is useful because websites providing service such as news, blogs, on-line stores, and emails, have many JavaScript actions that perform equivalent tasks.
In practice, there are normally several sets of equivalent events on a given page, and each event from an equivalent set may lead to displaying a single news item, a single blog entry, a single item in a store, or a single email. Each set may be referred to as a set of equivalent JavaScript events. Executing all possible equivalent JavaScript events of a website is a time consuming task that is not required in all cases. For example, when performing a security scan, a crawler is more interested in a structure of a webpage, than the text content of the webpage. In this example, executing just one link in an equivalent set is typically enough, with results being generalized for every other equivalent JavaScript action.
In addition, most websites on subsequent visits typically change the set of equivalent JavaScript events displayed to the user. Accordingly, a news site displays the latest news, a blog displays the latest blogs, and an on-line store displays the items on sale. The crawling of such websites, is further complicated because the container page comprising all equivalent JavaScript actions will never be the same, and therefore a crawler will not know that the current page was a previously visited page.
When a web crawler does not understand which JavaScript events are equivalent the crawler is typically not be able to identify whether the current page was previously visited because the content inside the red box has likely changed. The web crawler is typically unable to finish scanning a current website, because every action taken to modify a search criteria brings new content on the page.
Current workarounds for the identified problem typically include limiting a number of JavaScript actions executed on a page, or performing a human guided exploration of the website. Other solutions typically require the web crawler to execute JavaScript actions and compare the two DOMs that result.