The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Various applications, including Web crawlers and document processors, analyze web-accessible files (e.g., HTML files) to extract and index information therefrom. Other applications may extract information from files for other purposes. For example, an application may extract information from files to categorize the files, to rate the files, or to test to see if the files satisfy certain criteria. An application that analyzes files is referred to herein as a page analyzer. A Web crawler is merely one example of a page analyzer. While Web crawlers may be used in examples given hereafter, the techniques described herein are not limited to any particular type of page analyzer.
Web-accessible files typically contain mark-up language code (e.g., HTML, XML, etc.) that dictates how the information is to be displayed when the files are interpreted by a browser. The visual output that is produced when such files are interpreted by a browser is referred to herein as a “rendered page” or “web page”. Sometimes, interpreting a file to render a page involves executing code contained in the file.
The rendered page produced when a browser interprets a file often displays information contained in the file. In addition, the rendered page may include information referenced by the file, but not actually included in the file. While the file used to render a page typically contains basic textual characters, the page produced by interpreting the file may include structure, formatting, images, video, etc., instead of or in addition to simple text.
Unfortunately for the developers of page analyzers, many files contain executable code, such as JavaScript, that enables Web authors to create files that produce interactive web pages. A major use of web-based JavaScript is to write functions that are embedded in HTML files to perform tasks not possible in HTML alone. An example of this usage is opening or “popping up” a new window with programmatic control over the size, position and “look” of the new window (e.g., whether or not the menus, toolbars, etc. are visible). Another example of this usage is validating web form input values to make sure that the values will be accepted before they are submitted to the server. Another example is changing images or text as the mouse cursor moves over the images or text. This effect is often used to draw the user's attention to important links displayed as graphical elements.
Generally, the presence of executable code in files makes document analysis difficult, because most page analyzers are more interested in the content of the rendered page than in the content of the file that produces the rendered page. To effectively analyze pages that are produced by interpreting files, page analyzers must be able to gather text, follow links, capture popup windows, follow Asynchronous JavaScript and XML (AJAX) instructions, and otherwise analyze such files. Frequently, executable code within a file may cause content that is not reflected in the HTML of the file to be displayed when the file is interpreted by a browser. Page analyzers should take that content into account when analyzing the file.
The following is an example of a simple block of JavaScript code that may exist in a file:
<script> var a = yahoo.com; ... document.write(“http://www.” + a);</script>
Before the above block of JavaScript code is executed, there is no browsable link to http://www.yahoo.com. When the above JavaScript block is executed, the link “http://www.yahoo.com” is generated, and content located at that link may be retrieved and displayed in a browser. The result of interpreting a file and executing any executable code in the file is referred to hereinafter as the “rendered page.”
Without evaluating the content produced by execution of code on these files, the true intent of the web page authors is unknown. Numerous page analyzers may require a more accurate view of content on the Web than what is available strictly from the original file. For example, a page analyzer may index video content. Thus, the page analyzer aims to acquire links to video content on the Web. Frequently, however, video links are completely embedded in JavaScript, and therefore the video content is not indexed.
As another example, a page analyzer may extract product information, such as type of product and price, from files. Many product web pages are generated using templates. A few automated web publishing software packages generate highly dynamic content, such as product price, on web pages in JavaScript. Without complete JavaScript support, the page analyzer would not be able to make full sense of such files because the page analyzer would not be able to render the web page and, thus, would only “see” a portion of the actual content of the rendered page.
As another example, a page analyzer may be a web crawler that indexes information from files. Some spam websites use JavaScript code to conceal inappropriate content from web crawlers. Therefore, the page analyzer may interpret a file as providing legitimate information whereas a web browser may execute the JavaScript code and display only inappropriate content. This form of spam is known as “cloaking.”
Based on the foregoing, there is a need to better process files (e.g., HTML files) with embedded executable code (e.g., written in JavaScript) to better extract application-specific information for page analyzers, such as web crawlers.