The present invention relates generally to computer-based data retrieval and analysis, and more particularly, to web crawling.
In order to automate the discovery of computer-based documents, software tools commonly known as “crawlers” have been developed to retrieve computer-based documents, such as Hypertext Markup Language (HTML) based web pages, and navigate from computer-based document to computer-based document along hyperlinks, such as Universal Resource Locators (URLs), embedded in the documents that indicate the locations of other documents. When a crawler retrieves a computer-based document, it typically parses the document text to identify strings that appear to be hyperlinks based on predefined character sequences, such as strings that begin with the characters “http://” or “ftp://”. The crawler then retrieves computer-based document from the locations indicated by the identified hyperlinks, parses them, and so on. In this manner crawlers gather computer-based document content for later use, such as by search engines.
One of the challenges faced by crawlers is that some hyperlinks are not embedded as strings within computer-based documents, but rather are dynamically generated by computer program instructions found within the documents. For example, hyperlinks are often dynamically generated by Asynchronous JavaScript™ and XML (AJAX) instructions within a computer-based document that call entities, such as web servers, that are external to the document. As dynamically-generated hyperlinks are only generated when such instructions are executed, a crawler may employ an execution engine that executes such instructions within a computer-based document during crawling in order to discover any dynamically-generated hyperlinks that result from the execution. However, many of the computer program instructions within a computer-based document may be related to operations that do not yield dynamically-generated hyperlinks, such as rendering visual effects (e.g., highlighting a line of text on mouse hover), modifying content based on local reasoning (e.g., changing the order of items listed in a table), or performing client-side input validation (e.g., checking that an input box that is restricted to numeric values doesn't contain non-numeric characters). Thus, indiscriminate execution of the computer program instructions within a computer-based document during crawling is often wasteful and needlessly degrades the performance of the crawler.