A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This invention relates to structured information retrieval and interpretation from disparate semistructured information resources. A particular application of the invention is extraction of information from public and semipublic databases through worldwide information sources, as facilitated by the Internet.
The Internet provides avenues for worldwide communication of information, ideas and messages. Although the Internet has been utilized by academia for decades, recently public interest has turned to the Internet and the information made available by it. The World Wide Web (or xe2x80x9cthe Webxe2x80x9d) accounts for a significant part of the growth in the popularity of the Internet, due in part to the user-friendly graphical user interfaces (xe2x80x9cGUIsxe2x80x9d) that are readily available for accessing the Web.
The World Wide Web makes hypertext documents available to users over the Internet. A hypertext document does not present information linearly like a book, but instead provides the reader with links or pointers to other locations so that the user may jump from one location to another. The hypertext documents on the Web are written in the Hypertext Markup Language (xe2x80x9cHTMLxe2x80x9d).
As the popularity of the World Wide Web grows, so too does the wealth of information it provides. Accordingly, there may be many sites and pages on the World Wide Web that contain information a user is seeking. However, the Web contains no built-in mechanism for searching for information of interest. Without a searching mechanism, finding sites of interest would literally be like finding a needle in a haystack. Fortunately, there exist a number of web sites (e.g., YAHOO, ALTA VISTA, EXCITE, etc.) that allow users to perform relatively simple keyword searches.
Although keyword searches are adequate for many applications, they fail miserably for many others. For example, there are numerous web sites that include multiple entries or lists on job openings, houses for sale, and the like. Keyword searches are inadequate to search these sites for many reasons. Keyword searches invariably turn up information that, although matching the keywords, is not of interest. This problem may be alleviated somewhat by narrowing the search parameters, but this has the attendant risk of missing information of interest. Additionally, the search terms supported may not allow identification of information of interest. As an example, one may not be able to specify in a keyword search query to find job listings that require less than three years of experience in computer programming.
Ideally, it would be desirable if information like job listings on multiple web sites could appear as a single relational database so that relational database queries could be utilized to find information of interest. However, there is no standard for the structure of information like job listings on the Web. This problem was addressed in a co-owned, U.S. Pat. No. 5,826,258, in the name of Ashish Gupta, et. al., entitled xe2x80x9cMethod and Apparatus for Structuring the Querying and Interpretation of Semistructured Information,xe2x80x9d which introduced the concept of xe2x80x9cWrappersxe2x80x9d for retrieving and interpreting information from disparate semistructured information resources. Wrappers are programs that interact with web sites to obtain information stored in the web site and then to structure it according to a prespecified schema. In a copending U.S. patent application Ser. No. 10/000,743, in the name of Ashish Gupta, et al., entitled xe2x80x9cMethod and Apparatus for Creating Extractors, Field Information Objects and Inheritance Hierarchies in a Framework for Retrieving Semistructured Information,xe2x80x9d methods for obtaining information using wrappers are disclosed. However, these methods do not teach the information closure techniques of the present invention.
What is needed is a method of forming an information closure from related tuples of information for incorporation into a relational database.
According to the invention, a method is provided for forming an information closure of a plurality of rows in a linkage stack built by a wrapper program for accessing semistructured information. This method includes removing a first row from the linkage stack and computing a cross product of the fields in the first row. A step of adding this cross product to a list of accepted rows can also be part of the method. For each remaining row in the linkage stack, the method includes a step of computing a selective cross product according to a plurality of steps. In one step, a result is initialized to empty. Then, for each row in the list of accepted rows, a step of determining for a first new row from the accepted row, extended with the non-empty fields of the remaining row is performed. The method can also include a step of determining a second new row from the remaining row, extended with the non-empty fields in the accepted row. Thereupon, a step of adding the two new rows to the result can be performed. Repeating the determining steps and the adding step for all rows in the list of accepted rows, and removing from the result any identical rows can provide an information closure.
Numerous benefits are achieved by way of the present invention for enabling the use of a relational database to organize information obtained from a semistructured source, such as Web pages on the World Wide Web over conventional Web search techniques. In some embodiments, the present invention is easier to use than conventional user interfaces. The present invention can provide way to automatically propagate information to related tuples. Some embodiments according to the invention are easier for new users to learn than known techniques. The present invention enables data mining to be accomplished using a relational database. These and other benefits are described throughout the present specification.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.