A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This invention relates to structured information retrieval and interpretation from disparate semistructured information resources. A particular application of the invention is extraction of information from public and semipublic databases through worldwide information sources, as facilitated by the Internet.
The Internet provides avenues for worldwide communication of information, ideas and messages. Although the Internet has been utilized by academia for decades, recently public interest has turned to the Internet and the information made available by it. The World Wide Web (or xe2x80x9cthe Webxe2x80x9d) accounts for a significant part of the growth in the popularity of the Internet, due in part to the user-friendly graphical user interfaces (xe2x80x9cGUIsxe2x80x9d) that are readily available for accessing the Web.
The World Wide Web makes hypertext documents available to users over the Internet. A hypertext document does not present information linearly like a book, but instead provides the reader with links or pointers to other locations so that the user may jump from one location to another. The hypertext documents on the Web are written in the Hypertext Markup Language (xe2x80x9cHTMLxe2x80x9d).
As the popularity of the World Wide Web grows, so too does the wealth of information it provides. Accordingly, there may be many sites and pages on the World Wide Web that contain information a user is seeking. However, the Web contains no built-in mechanism for searching for information of interest. Without a searching mechanism, finding sites of interest would literally be like finding a needle in a haystack. Fortunately, there exist a number of web sites (e.g., YAHOO, ALTA VISTA, EXCITE, etc.) that allow users to perform relatively simple keyword searches.
Although keyword searches are adequate for many applications, they fail miserably for many others. For example, there are numerous web sites that include multiple entries or lists on job openings, houses for sale, and the like. Keyword searches are inadequate to search these sites for many reasons. Keyword searches invariably turn up information that, although matching the keywords, is not of interest. This problem may be alleviated somewhat by narrowing the search parameters, but this has the attendant risk of missing information of interest. Additionally, the search terms supported may not allow identification of information of interest. As an example, one may not be able to specify in a keyword search query to find job listings that require less than three years of experience in computer programming.
Ideally, it would be desirable if information like job listings on multiple web sites could appear as a single relational database so that relational database queries could be utilized to find information of interest. However, there is no standard for the structure of information like job listings on the Web. This problem was addressed in a co-owned, U.S. Pat. No. 5,826,258, in the name of Ashish Gupta, et. al., entitled xe2x80x9cMethod and Apparatus for Structuring the Querying and Interpretation of Semistructured Information,xe2x80x9d which introduced the concept of xe2x80x9cWrappersxe2x80x9d for retrieving and interpreting information from disparate semistructured information resources. Wrappers are programs that interact with web sites to obtain information stored in the web site and then to structure it according to a prespecified schema. In a copending U.S. patent application Ser. No. 10/000,235, in the name of Ashish Gupta, et. al. entitled, xe2x80x9cMethod for Creating an Information Closure Modelxe2x80x9d methods for forming the information closure of information gathered by a wrapper are disclosed. However, the methods for formulating extractors, field objects and inheritance hierarchies in a wrapper framework of the present invention are heretofore not known in the art.
What is needed is a method of formulating extractors, field objects and inheritance hierarchies for retrieving and interpreting information from semistructured resources for incorporation into a relational database.
According to the invention, a system is provided for extracting information from a semistructured information source. The system includes a listing stack for holding extracted information. A means for matching at least one extractor to the semistructured information to return a list of potential matches is also included. The system can also include a means for iterating through the list of potential matches and a means for retrieving information from a particular match in the list of potential matches. A means for adding a particular match into the listing stack can also be part of the system.
In another aspect of the present invention, a method for extracting information from a semistructured information source into a listing stack is provided. The step of matching at least one extractor to the semistructured information in order to return a list of potential matches is included in the method. A step of iterating through the list of potential matches can also be part of the method. Information from a particular match in the list of potential matches can be retrieved in another step. The method can also include a step of adding a particular match into the listing stack. Combinations of these steps can extract information from a semistructured information source.
Numerous benefits are achieved by way of the present invention for enabling the use of a relational database to organize information obtained from a semistructured source, such as Web pages on the World Wide Web, over conventional Web search techniques. In some embodiments, the present invention is easier to use than conventional user interfaces. The present invention can provide way to automatically propagate information to related tuples. Some embodiments according to the invention are easier for new users to learn than known techniques. The present invention enables data mining to be accomplished using a relational database. These and other benefits are described throughout the present specification.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.