1. Field of the Invention
In one embodiment, this invention concerns crawling a web site on the World Wide Web (“WWW”), and more specifically a web site wherein at least one web page in the web site has a reference for executing by a browser to produce an address for a next page. In this context, the invention also concerns web pages that are dynamically generated responsive to queries, such as queries associated with the crawling. This includes preemptively transforming dynamically generated web pages into static web pages.
2. Description of the Related Art
The World Wide Web is an interconnected network of computers and information appliances. Clients use the WWW to send requests to servers, which send back responses. “Static data” is server response data that already exists on the server at the time of the request and that is merely served back to the client without change by the server. A news article is an example of static data. While the news may change daily, or even minute-by-minute, nevertheless, according to the typical scenario, an article is created in response to a news item, and then the article is put on the server as a static document. That is, the article itself does not subsequently change, even though other, newer articles will sooner or later be placed on the server too. This is in contrast to “dynamic data” that is created by the server in direct response to a client request. A web page displaying a bank account balance or stock positions in a trading account is an example of a page that is created by the server and that may change with each interaction between the client and the server.
Dynamically generating data tends to impose a substantial load on server resources. While this load could be mitigated by precomputing responses in anticipation of all possible requests, such a course of action will require substantial resources for both computing the resonses and for storing them. Some middle ground may be ideal, where the server stores most likely requested information in static form, and creates other information on demand.
A web site may thus choose to convert a subset of its content to be delivered statically in order to reduce server resource demands. The conversion may be done manually, by employing web page designers. However, such a scheme is inflexible since any changes to the data or its presentation will require a large number of web pages to be manually recreated. An alternative is to automatically generate the static responses using the raw data on the web site. A program could be set up to format the extracted data (such as from database queries) and encapsulate it within the appropriate HTML content, thus creating static pages. The disadvantage of this approach is that the program must be provided with the parameters with which to query the database as well as the inter-document hierarchy which specifies how the documents will be hyperlinked together. Determining and providing this information requires significant resources. There is thus a need for an automated method that can be used to easily convert subsets of a web site to static content.
Use of a crawler would be advantageous for automating this conversion of a web site to static content, however there are numerous difficulties that prevent this. Conventional search engine crawlers start with a URL and repeatedly de-reference all unexplored URLs in the received responses. One reason conventional crawlers are not suitable for the “staticizing” problem concerns actions sequences that must be performed to obtain a particular end data set for a conventional HTML query. Furthermore, references from one web page to another may not be straightforward. That is, a reference may not be simply set out on the page as a hyperlink address, but instead may be a script, form, selection menu, or button for example. Thus a need exists for improvements in crawler programs, to overcome their limitations so that they may be used for the staticizing problem as well as other applications.