This invention relates to the retrieval and interpretation of disparate semistructured information from diverse sources, and particularly to the retrieval of information from form representations. The invention is especially useful in the extraction of information from public and semipublic databases through worldwide information sources, as facilitated by the Internet.
The Internet provides avenues for worldwide communication of information, ideas and messages. Although the Internet has been utilized by academia for decades, recently public interest has turned to the Internet and the information made available by it. The World Wide Web (or "the Web") accounts for a significant part of the growth in the popularity of the Internet, due in part to the user-friendly graphical user interfaces ("GUIs") that are readily available for accessing the Web.
The World Wide Web makes hypertext documents available to users over the Internet. A hypertext document does not present information linearly like a book, but instead provides the reader with links or pointers to other locations so that the user may jump from one location to another. The hypertext documents on the Web are written in the Hypertext Markup Language ("HTML").
As the popularity of the World Wide Web grows, so too does the wealth of information it provides. Accordingly, there may be many sites and pages on the World Wide Web that contain information a user is seeking. A number of web sites (e.g., Yahoo, Alta Vista, Excite, etc.) enable users to perform simple keyword searches. However, the Web contains no built-in mechanism facilitating searching for information of interest. Without a searching mechanism, finding sites of interest is like finding a needle in a haystack.
Although keyword searches are adequate for many applications, they fail miserably for many others. For example, there are numerous web sites that include multiple entries or lists on job openings, houses for sale, and the like. Keyword searches are inadequate to search these sites for many reasons. Keyword searches invariably turn up information that, although matching the keywords, is not of interest. This problem may be alleviated somewhat by narrowing the search parameters, but this has the attendant risk of missing information of interest. Additionally, the search terms supported may not allow identification of information of interest. As an example, one may not be able to specify in a keyword search query to find job listings that require less than three years of experience in computer programming.
Ideally, it would be desirable if information like job listings on multiple web sites could appear as a single relational database so that relational database queries could be utilized to find information of interest. However, there is no standard for the structure of information like job listings on the Web. This problem was addressed in a co-owned, co-pending U.S. patent application Ser. No. 08/724,943, in the name of Ashish Gupta, et. al., entitled "Method and Apparatus for Structuring the Querying and Interpretation of Semistructured Information," which introduced the concept of "Wrappers" for retrieving and interpreting information from disparate semistructured information sources. Wrappers are programs that interact with web sites to obtain information stored in the web site and then to structure it according to a prespecified schema. Therefore, a wrapper needs to be able to "access" web sites much the same as a web browser.
Forms are an increasingly popular way of rendering web sites. Forms typically consist of one or more fields that need to be filled in with values. The fields that are displayed in forms may be represented in different ways, for example, as pull down menus, select lists, check boxes or fill-in text boxes. To obtain information from the web site, a user fills in values for each field and submits the form in order to receive a resulting response page. Thus, it is necessary that wrappers be able to interact with sites that present form interfaces.
What is needed is a method for data gathering around forms and other barriers, enabling wrappers to have the capability of extracting information from form based web sites.