The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
The web, including the Internet, is a complex event-driven platform with numerous applications, technologies, protocols, and use-cases available to address a variety of real world problems. On the web, a set of HTML code that a browser can render (“webpage”) may be a functional unit of content. The webpage usually contains a link to another webpage. The webpage may comprise various elements, including hypertext, images, and videos. The webpage may be part of a web application that comprises more than one webpage in one or more sequences. The web application may comprise, for example, an online banking web application. The online banking web application may include static or dynamically generated sequences of webpages for logging in to an account, viewing an account, paying bills electronically, depositing money, managing an account, and performing other tasks.
A web crawler is software that is programmed to retrieve webpages, including webpages in web applications, and to follow each link in a retrieved webpage to reach other webpages. A crawler, when given a set of Uniform Resource Locators (URLs), referred to as “seed URLs”, retrieves each webpage and, using hyperlinks included in the webpage, retrieves the linked webpages. For each webpage that is retrieved, the crawler parses or scans the HTML source code of the webpage and records data. The crawler may record data about the webpage, record data about the visit such as a timestamp, and/or may save a copy of the webpage itself. Without intervention, the crawler will continue to crawl to further linked pages until instructed to stop.