Database systems store enormous amounts of information that can be accessed by users for identification and retrieval of valuable documents that contain data, text, audio and video information. A typical example of a database system 100 is shown in FIG. 1. Information processing units 101a to 101n can be any of the following: personal computers (DOS, WINDOWS or Macintosh, Linux machines), workstations, a client, a dumb terminal or equivalent. Hub processing units 102a to 102y can be any of the following: a server, a master, a database controller or equivalent. Network (100) can be any of the following: a token ring network, a star network, a telecommunication switching network, a local area network (LAN), a wide area network (WAN), a corporate intranet, the Internet or equivalent. Information processing units 101a to 101n are in communication with hub processing units 102a to 102y via network 100. The sharing of data across network 100 is accomplished by computer search programs 103a to 103x operating in conjunction with the hub processing units 102a to 102y. The search programs can be located on the hub processing units themselves or on another processing units that are not shown. In addition, a user employs a graphical user interface (GUI) 104a to 104n that permits him or her to submit search queries across network 100 to the hub processing units.
Upon reception of the search query, the hub processing units forward the request to the search programs 103a to 103x for completion of the transaction. As is well known, search programs provide Boolean Operators (AND, OR NOT) to help build more sophisticated queries in order to narrow down the search result set. These Boolean Operators are used to provide the various permutations to the search programs 103a to 103x which uses these to locate pertinent documents. Once in possession of the search query, the search programs compare the requested search parameters against documents stored in databases 105a to 105z. Finding words or phrases that compare favorably with the search query, the search programs return a list of relevant documents to the information processing units 101a to 101n as well as library information such as type of document, location and highlighted words or phrases indicating the flags that caused the search program to retrieve the particular document. Finally, the search results are loaded into the graphical user interface GUI 104a to 104n for the user's review.
The search programs 103a to 103x used to return the search results in FIG. 1 are commonly referred to as “web crawlers”. Today's web crawling methodologies are already able to retrieve heterogeneous, static content from the World Wide Web (WWW). However, as more and more designers use dynamically generated content in their web-based documents, existing crawling techniques are not always capable of retrieving the data correctly. Known Enhanced Crawling architectures are able to simulate user interaction; thus, these enable automatic crawling of web sites that dynamically generate their data and associate data with session information.
Referring now to the flow diagram 200 of FIG. 2, a typical web crawler performs two main operations in order to execute the crawling process; namely, the access—retrieval of a document (202) and then the analysis phase of the document, also called the summarization process (204). Whereas today's web crawler might be able to access a dynamically generated document correctly (e.g., a document generated through Active Server Pages, Perl Script or an equivalent), the summarization process will fail or produce flawed results if the document itself contains executable client side software code. The reason for this is that the client side software code (e.g., JavaScript, VBScript, or equivalent) is targeted to be executed and interpreted within a web browser's scripting engine. Eventually the code will be replaced with content, or the code produces content. Web designers often make use of this feature to dynamically create content on the client side; examples of this can include computation results which are originated based on some user input, or specific text based on the client's web browser version used or some other such equivalent. More generally, dynamic documents rely on a web browser's capabilities to:    a) retrieve additional documents (206) as needed (frames, in-line images, audio, video, applets, or equivalents) or required;    b) execute client side script (208) and code (JavaScript or equivalents);    c) furnish a fault tolerant HTML filter to recognize various HTML standards and interpret HTML markup errors; unscramble content that a web designer has purposefully scrambled in order to thwart crawling and other programmatic analysis methods; thereby produce a final HTML markup (210); and    e) integrate all these previously obtained results to render (212) the document for presentation to a user (214).As one can see the information unit 101a to 101n side web browsing 104a to 104n process can become very complicated and convoluted. That's the reason why it's not a trivial task to implement a decent web browser. Further, there are additional problems involved in this implementation. As an example, a web browser has to achieve fault tolerance in regard to the underlying HTML used to create a document. First, there are many different HTML versions and standards currently available. Second, human error is introduced into the document when individuals do not correctly compose HTML markup. They forget brackets, use the wrong syntax, arguments, parameters or other such errors that necessitate a fault-tolerant browser. Therefore, there is a need for a fault-tolerant web crawler that does not fail when summarizing dynamic documents.
Further, today's web designers often make intensive use of images and image maps to represent text data in documents. Some of these documents consist only of images and the images themselves contain all the textual data and other information in the document. However, standard web crawlers will not be able to summarize such a document. Therefore, there is a need for a web crawler that can interpret and summarize textual and other information contained within the body of a web-based image document.
In summary, a web browser has to execute a complicated algorithmic process in order to eliminate the problems previously described; this complex algorithmic process enables the browser to present and render a document in the manner that the web document composer intended it to be displayed. A web browser's functionality is similar to that of a multi-tasking management component which has to coordinate several tasks to yield an effective end product. The web browser must coalesce information from a variety of sources to produce the final HTML code which will be rendered and displayed.