1. Field of the Invention
The present invention generally relates to a method of extracting information from structured documents such as HTML documents or the like, and particularly relates to an information extraction method that identifies and extracts a desired text portion selected in advance from daily updated structured documents. Further, the present invention relates to a user interface by which a desired portion can readily be selected in a structured document.
2. Description of the Related Art
There are needs for a means to select a particular portion from a structured document such as an HTML (hyper text markup language) document or the like that is daily updated. For example, a user may wish to select portions of particular interest from Web pages that the user is familiar with, putting these portions together to create a collection of information which allows the user to readily view only necessary information. When the source of collected information is daily updated, the selected portion needs to be identified again and again in the daily updated document for use in the collection.
Japanese Patent No. 2867986 directed to a WWW information extraction system teaches storing information indicative of a start point and an end point of a portion selected in advance. Based on this information, the start point and the end point are identified in the updated document, followed by extracting the portion existing between these two points as the selected portion. For example, texts corresponding to the start point and the end point, respectively, of the selected portion are stored in memory. When extracting the selected portion from the document, the stored texts are used to identify the start point and the end point in the HTML document, followed by extracting the identified portion.
A system proposed by webMethods corporation (http://www.w3.org/TR/NOTE-widl) and a system proposed by Luca Iocchi (Luca Iocchi: The Web-OEM approach to Web information extraction, Journal of Network and Computer Applications, Vol. 22, pp. 259-269 (1999)) approach this issue by converting an HTML document into a tree structure, storing information about a partial tree corresponding to a portion selected in advance, and identifying a portion of the updated document that corresponds to the stored partial tree. Here, information about a partial tree is comprised of a character string serving as an identifier of the selected portion. A tag name is used as a tag identifier, and tag names at the same hierarchical level in the tree structure are provided with respective numerical value indexes. The tag names paired with the numerical value indexes are connected in series to make the character string for representation of a structure from the root of the whole tree to the root of the partial tree, which corresponds to the selected portion. In an example of FIG. 1, “doc” is regarded as the root of the whole tree, and the identifier that points to the selected portion “local news” is represented as “doc.table[0].table[0]”.
In the related-art method disclosed in Japanese Patent No. 2867986 regarding the WWW information extraction system, a selected portion is extracted based on the information indicative of the start point and end point of the selected portion. It naturally follows that such information needs to be an item that always remains intact in the document after updating. It is difficult, however, to identify enduring information that is unchanged through updating. Many exceptions exist on homepages on the Internet as designs of such homepages tend to be at the designers' discretion, so that the method as described above may not be applicable to a wide range of application areas.
If texts corresponding to the start and end points are used as a clue in the WWW information extraction system, these texts themselves may be subjected to updating as shown in FIG. 2. In such a case, this method fails.
Further, if a selected portion is extracted as shown in FIG. 3A by this method, the extracted portion does not constitute a proper partial tree as a tree structure, an example of which is shown in FIG. 3B. Because of this, difficulties would be encountered if an attempt is made to make use of this extracted portion in another structured document.
The method utilizing the identifier of a partial tree of a selected portion as taught by the webMethods corporation or Luca Iocchi relies on the premise that the document structure does not change through updating. If the document structure ever slightly changes through updating, the identifier of a partial tree selected in advance will not match an identifier after updating.
For example, a text block having the same tag as an existing tag may be inserted into the same hierarchical level of the tree structure to which the selected portion of the document belongs. This results in a numerical value index of the tag being changed in the identifier of the partial tree. In the example of FIG. 1, the document is updated by inserting the text regarding “ADVERTISEMENT 2” bracketed in table tags above the selected portion. As a result, the numerical value index of the tag identifier based on the tag name “table” in respect of the selected “local news” is change from “table[0]” to “table[1]”. Such small format changes are likely to be made on a site top page where banners, breaking news, etc., are inserted and deleted constantly. Since such a site as having constant updating of information is the very kind of site that users wish to select portions from, the degradation of reliability of portion identification needs to be addressed if such degradation occurs through minor updating.
When a tag that was not in existence at the time of the portion selection is inadvertently left open above the selected portion, this tag appears as a parent node relative to the selected portion. In the example of updating shown in FIG. 1, the table tag inclosing “ADVERTISEMENT 1” above the selected portion is inadvertently left open. As a consequence, an identifier that should correctly appear as “doc.table[0].table[0]” becomes “doc.table[0].table[0].table[1]”, which indicates the existence of a table tag as a parent node of the selected portion “local news”. This makes the identifier of the partial tree fail to match between before and after updating. WWW browsers widely used today permit open-ended tags, and page designers often update pages without noticing the fact that open-ended tags are present in the pages.
Insertion of a text block having the same tab and inadvertent lack of a closing tag causes a trouble in the example of updating of the document shown in FIG. 1. Namely, the identifier of a partial tree that points to the selected portion is changed from “doc.table[0].table[0]” to “doc.table[0].table[0].table[1]”.
The methods proposed by the webMethod corporation and Iocchi further have a problem in that knowledge of tags and document structures and skill are necessary when selecting a portion in a structured document such as an HTML document.