1. Field of Invention
The present invention relates generally to the field of automated retrieval of World Wide Web documents. More specifically, the present invention is related to automated retrieval of World Wide Web documents not available via static hyperlinks.
2. Discussion of Prior Art
A search engine is a program that searches documents for specified keywords and returns a list of the documents where the keywords were found. Although search engines are a general class of programs, one well-known type of search engine enables users to search for Web pages on the World Wide Web (“Web”).
These search engines typically work by using a program, known as a Web crawler, that fetches as much Web content (i.e., hypertext markup language (HTML) pages and other documents) from the Web as possible. Another program, called an indexer, then reads the fetched documents and creates an index based on the words contained in each document.
Web crawlers find and fetch Web content by following hyperlinks, which are Uniform Resource Locators (URLs), appearing in the body of HTML pages. A limitation in today's Web crawlers is that they only follow static hyperlinks, i.e. links in which the full URL is plainly visible in the HTML document and easily extracted by the crawler.
In contrast, there is a large volume of content available on the Web that is not accessible via static hyperlinks. This content is generated dynamically based upon user interactions with the Web site. One example is the content that resides in Web databases. Generally, this content is accessible only through directed queries resulting from HTML forms. Without a directed query, content in the database is not published. When the database is queried, the results are returned as dynamic Web pages in real-time.
It would be beneficial for Web crawlers to be able to retrieve the additional content that is not accessible via static hyperlinks, especially since the content generated in response to following HTML forms typically originates from proprietary databases containing highly valuable competitive information. For instance, Amazon.com™ has a database of millions of books that it sells; yet static hyperlinks (in the form of browsable categories) are provided only to the bestsellers in different categories, not the entire database. Therefore, a Web crawler that only follows static hyperlinks will see only a small fraction of the entire database.
For a Web crawler to access this content, it has to emulate the communications between a Web browser and the Web server that results from user interaction with the Web site. For instance, for Web databases accessible via HTML forms, what a user places in the input items of the form is encoded in an HTTP message or a URL, which is used to query the database. For a Web crawler to access the content in the Web database behind the form, it has to generate similar HTTP messages or URLs that contain valid and relevant entries in the input items of the form. Therefore, to generate such synthetic queries, a Web crawler has to determine what to place in various input items appearing in a form. There are difficulties, however, in determining what to place in the various input items.
Generally, there are two main types of input items appearing in a form: selection items (pulldown menus, check boxes, radio buttons, etc.) and text entries. While it is possible for a Web crawler to compute all possible combinations of selection items and produce an exhaustive list of alternatives, this results in a very inefficient method for content access. Furthermore, the Web site hosting the content may cut the Web Crawler off after noticing the onslaught of crawler accesses.
Text entries present a related but different problem. The Web crawler has little or no idea what to enter as text, since the form itself gives little, or no, information (e.g. data type, valid values, meaning of the variable, expected outcome, etc.) that could be used for such determination. Text entries can be used for entering personal information such as usernames and addresses, but most commonly they are used for entering free-text queries (e.g. search Amazon.com's book database by author name).
Therefore, to generate synthetic queries for a Web database, a Web crawler needs an understanding of the form variables for the database. Further, to extract data efficiently from a Web database, a Web crawler must issue intelligent queries rather than indiscriminate combinations that may not have any relevance. What is needed, then, is a Web crawler that not only accesses content contained in a Web database, but that accesses it by generating realistic data for the form front-end, in order to be able to access the largest possible fraction of the database behind the form. More generally, what is needed is a Web crawler that efficiently mimics a real user's interaction with a Web site to automatically access the largest possible amount of content not available via static hyperlinks.