It is often useful to extract the information contained in HTML pages in a form that can be used as input to a computer program that can analyze and reformat it for further use. One such use is an automated agent that extracts the relevant information and stores it for data-mining purposes. For example, a program might be devised that monitors movies on screen at various locations. Such a program needs to extract from the relevant HTML page the titles of the movies, the theaters where they are shown, and the times at which they are screened. Another example is a program that extracts the information contained in an HTML page to display it on devices other than a computer screen, e.g. the screen of a hand-held device. Since hand held devices have much smaller screens than typical desk top computers, it is necessary to extract only the relevant information and filter all the rest, and to re-format this information in a form suitable for display on the hand held device. In the movie example above, the list of theaters in each location, the movies shown at each theater, and the screening times are the relevant information, and all the rest of the material in the HTML page, e.g. promotions, discussions, etc. need to be filtered out. Furthermore, the extracted information needs to be structured in a way that the relationship between theaters, movies, and show times is explicit, so that menus can be generated that allow the user to navigate the screens to find, for example, the show times of a given movie at a given theater.
Therefore, as well as filtering out irrelevant data from the HTML page, it is necessary to structure the extracted information in such a way that the underlying relationship between the various items of data is made explicit. For example, it is not enough to extract the names of theaters, the titles of movies and the show times. The resulting data structure must also make explicit the relationship between theaters, movies and showtimes, i.e., which movies are shown at each theater, and which show times apply to which movie at which theater.
There is a need for a process whereby an HTML file belonging to a pre-determined class of HTML files can be transformed into an instance tree that contains all the relevant extracted information, and that makes explicit the internal structure of the data. The related art is represented by the following patents of interest.
U.S. Pat. No. 5,079,700, issued on Jan. 7, 1992 to Michael J. Kozoll et al., describes a method for copying a marked portion of a structured document so as to prevent damaging the structure of the document at the target location where the contents of the mark are to be inserted. The Kozoll et al. '700 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,113,341, issued on May 12, 1992 to Michael J. Kozol et al., describes a method for hierarchically expanding and contracting element marks in a structured document. The Kozol et al. '341 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,140,521, issued on Aug. 18, 1992 to Michael J. Kozol et al., describes a method for deleting a marked portion of a structured document so as to prevent damaging the structure of the document. The Kozol et al. '521 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,276,793, issued on Jan. 4, 1994 to Kenneth W. Borgendale et al., describes a method and apparatus for editing a structured document to preserve the intended appearance of document elements. The Borgendale et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,276,793, issued on Jan. 4, 1994 to Kenneth W. Borgendale et al., describes a method and apparatus for editing a structured document to preserve the intended appearance of document elements. The Borgendale et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,530,852, issued on Jun. 25, 1996 to Carl F. Meske, Jr. et al., describes a method for extracting profiles and topics from a first file written in a first markup language and generating files in different markup languages containing the profiles and topics for use in accessing data described by the profiles and topics. The Meske, Jr. et al. '852 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,557,720, issued on Sep. 17, 1996 to Allen L. Brown, Jr. et al., describes a method for determining whether a document tree is weakly valid. The Brown, Jr. et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,557,722, issued on Sep. 17, 1996 to Steven De Rose et al., describes a data processing system and method for representing and generating a representation of, and random access rendering of, electronic documents. The Rose et al. '722 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,644,776, issued on Jul. 1, 1997 to Steven De Rose et al., describes a data processing system and method for random access formatting of a portion of a large hierarchical electronically published document with descriptive markup. The Rose et al. '776 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,649,186, issued on Jul. 15, 1997 to Gregory J. Ferguson, describes a system and computer-based method for providing a dynamic information clipping service. The Ferguson patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,671,416, issued on Sep. 23, 1997 to David Elson, describes a method and apparatus for searching and modifying source code of a computer program. The Elson patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,680,619, issued on Oct. 21, 1997 to Norman K. Gudmundson et al., describes an application development system that enables its users to create reusable “object containers” merely by defining links among instantiated objects. The Gudmundson et al. '619 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,708,806, issued on Jan. 13, 1998 to Steven DeRose et al., describes a data processing system and method for generating a representation of an electronic document, for indexing the electronic document to generate the representation for navigating the electronic document using its representation and for displaying the electronic document, formatted according to a style sheet, on an output device. The DeRose et al. '806 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,784,608, issued on Jul. 21, 1998 to Carl F. Meske, Jr. et al., describes a system and computer-implemented method for retrieving hypertext information using profiles and topics. The Meske, Jr. et al. '608 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,794,006, issued on Aug. 11, 1998 to David S. Sanderman, describes an on-line content editing system which operates as an extension of a computer's operating system to provide a graphical interface which displays system operator editing menus. The Sanderman patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,794,704, issued on May 25, 1999 to Norman K. Gudmundson et al., describes an on-line content editing system which operates as an extension of a computer's operating system to provide a graphical interface which displays system operator editing menus. The Gudmundson et al. '704 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,923,738, issued on Jul. 13, 1999 to Raymond A. Cardillo IV et al., describes a screen-display telephone terminal for interfacing with the Internet. The Cardillo IV et al. '738 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,926,823, issued on Jul. 20, 1999 to Yo Okumura et al., describes a document generic logical information editing apparatus for editing document generic logical information for document editing purposes in such a manner that the arrangements for designating automatic document editing processes such as search, manipulation, and composition of document elements are simplified using the edited information; that the procedures for transferring and removing unnecessary data are eliminated; and that users' chores associated with extra tasks of such data handling are alleviated. The Okumura et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,930,341, issued on Jul. 27, 1999 to Raymond A. Cardillo IV et al., describes a browser device and method for interfacing screen-display telephone terminals with the Internet. The Cardillo IV et al. '341 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,937,041, issued on Aug. 10, 1999 to Raymond A. Cardillo IV et al., describes a system and method for interfacing screen-display telephone terminals with the Internet. The Cardillo IV et al. '041 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,953,322, issued on Sep. 14, 1999 to Robert H. Kimball, describes a cellular telephone that provides the capability of performing Internet telephone calls. The Kimball patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,953,732, issued on Sep. 14, 1999 to Carl F. Meske, Jr. et al., describes a system and computer-implemented method for retrieving hypertext information using profiles and topics. The Meske, Jr. et al. '608 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,970,490, issued on Oct. 19, 1999 to Matthew Morganstern, describes a method for integrating heterogeneous data embodied in computer readable media having source data and target data including providing an interoperability assistant module with specifications for transforming the source data, transforming the source data into a common intermediate representation of the data using the specifications, transforming the intermediate representation of the data into a specialized target representation using the specification. The Morganstern patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,978,579, issued on Nov. 2, 1999 to Jeffrey J. Buxton et al., describes a component customization and distribution system in an object-oriented environment that provides a template builder utility which enables a base component to be selectively modified and the modifications to the base component stored as a template. The Buxton et al. '579 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 5,983,248, issued on Nov. 9, 1999 to Steven DeRose et al., describes a data processing system and method for generating a representation of an electronic document, for indexing the electronic document, for navigating the electronic document using its representation and for displaying the electronic document on an output device. The DeRose et al. '248 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 6,041,331, issued on Mar. 21, 2000 to Michael L. Weiner et al., describes a method for extracting information from a plurality of documents for display. The Weiner et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 6,065,024, issued on May 16, 2000 to David S. Renshaw, describes a method and apparatus for realizing embedded HTML documents. The Renshaw patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 6,081,815, issued on Jun. 27, 2000 to Kim L. Spitznagel et al., describes a method for processing a hyperlink formatted message to make it compatible with an alphanumeric messaging device that lacks hyperlink decoding capability. The Spitznagel et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 6,083,276, issued on Jul. 4, 2000 to Harold R. Davidson et al., describes a method and apparatus for creating and configuring a component-based application through a simple, XML-compliant, text based document. The Davidson et al. patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
U.S. Pat. No. 6,093,215, issued on Jul. 25, 2000 to Jeffrey J. Buxton et al., describes a component customization and distribution system in an object-oriented environment that provides a template builder utility which enables a base component to be selectively modified and the modifications to the base component stored as a template. The Buxton et al. '215 patent does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
European Patent document 0 539 120 A1, published on Apr. 28, 1993, describes an apparatus for discovering information about the source code of a computer program. The European '120 patent document does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
European Patent document 0 718 783 A1, published on Jun. 26, 1996, describes a system and computer-implemented method for retrieving hypertext information using profiles and topics. The European '783 patent document does not suggest a method and apparatus for extracting structured data from HTML pages according to the claimed invention.
None of the above inventions and patents, taken either singularly or in combination, is seen to describe the instant invention as claimed.