In recent years, from various viewpoints, attention has been paid to research on highly efficient reuse of portions in Web pages which are present in large amounts and include important contents, by cutting out and converting the portions into individual parts. Note that, in this specification, the term “cutout” is used in meaning for general use by those skilled in the art, and by the “cutout,” “cutout” portions are not deleted from a Web content from which the portions are “cut out.” Strictly speaking, the “cutout” in this specification is to copy a range of target content portions in an original Web content or the like in order to paste the target content portions to another Web page or the like.
In the field of Web services, content cutout has attracted attention as a bridging technology for bridging the existing HTML contents and the Web services. For example, the existing server system can be adapted to the Web services as it is by cutting out, for example, an HTML form for searching an article on a news site and by defining XML input/output to the HTML form.
Moreover, in the field of information portals, which aggregate various types of information and provide portal pages coinciding with requests of users, partial components in the existing Web pages are important contents. Regions of top news and headlines are cut out from various news sites and are freely combined, thus making it possible to expand the contents to a great extent. Actually, in the mySiteOutliner, the WebSphere Portal Server or the like, a mechanism for incorporating a part of the existing Web pages into the portal pages is provided as a part of the product.
In addition, a standard, which allows a third party to utilize information updated on Web sites and the like by providing the information in an XML form called RSS (Rich Site Summary), has been widespread. At present, the RSS is generated by preparing an exclusive server-side program (CGI and the like). However, if the page cutout technology is used, then conversion of a headline list in a page into the RSS makes it possible to provide a dynamic and highly immediate RSS.
Furthermore, in the field of transcoding, a technology has been researched, in which important information in Web pages is submitted preferentially, thus converting the Web pages into pages which are easy for users of pervasive devices and amblyopia users using enlarged browsers to read. A function of conducting page clipping based on annotation description on the XPath base is implemented also in the IBM WebSphere Transcoding Publisher.
As described above, it has been known that the part of the Web content can be reused highly efficiently by being cut appropriately.
(1) As methods for cutting out the part of the Web pages in the related art, there are two methods, which are:
(a) a method using the XPath; and
(b) a method using an original tag.
(a) Method Using XPath:
The method using the XPath is a powerful method when the Web pages are assured to be static and unchanged. For example, in the non-patent document 1, the cutout of a content by use of XPath designation is implemented in order to generate pages for portable terminals. However, the designation is troublesome, an application range thereof is narrow, and so on, and therefore, actually, another type of pages for the portable terminals is frequently prepared. Specifically, this method is not actually widespread. Moreover, in the non-patent document 2, a schema is proposed, in which a part of Web pages is selected, and an input portion and an output portion are selected, thus easily enabling the Web pages to be incorporated into the Web services. Although this technology is excellent in that the Web pages can be easily cut out and coupled to the services, the technology involves a problem that it depends on the XPath with regard to the cutout. Furthermore, in the non-patent document 3, a list of images and articles is cut out from the top page of the home page of IBM and the like by use of the Xpath, and the cutout list is incorporated into a part of a “personal newspaper.” The cutout portions are shifted due to a layout change. Therefore, the shift of the cutout portions is coped with by manually correcting the definition file of the Xpath, followed by automatically delivering the cutout portions.
(b) Method Using Original Tag:
In this method, the original tag is mixed into HTML tags. A particular character string is sometimes designated for an HTML comment. This method is widely used in a portal service such as LYCOS and YAHOO. For example, this method is used for the purpose of displaying an explanation of recommended goods on a shopping page also onto the top page. Because this method can be processed by the simple HTML parser and the like, this method is frequently used in the case of using the HTML parser. This method involves a problem that an original content must be changed.
Related arts similar to the present invention will be listed below though they are not the technologies for cutting out the part of the Web page content.
(2) Dynamic Annotation Matching Method Using XPath Set as Key (Japanese Patent Application No. 2001-333260 not Yet Laid-Open at the Time of Preparing this Specification):
In this method, an XPath included in an annotation is used as a key, and a suitable candidate for the annotation is selected from the plurality of candidates therefor. According to this method, a correct annotation matching has been enabled in many cases by preparing annotations sufficient for covering the entire layout. However, also in many cases, the XPath indicates an incorrect node at an authoring step. As functions for correcting this incorrect node, functions such as an empty content alert, a leaked text alert and a semi-automatic correction of the XPath have been developed. However, in the actual situation, adjustment work is troublesome.
(3) Other Annotation Matching Methods:
In many cases such as an RDF, the annotations and the pages are matched by use of a collation table and a normal expression of a URL. The present invention greatly differs from these methods in that it performs dynamic matching with the content.
(4) Finite Difference Calculation and Use Thereof.
As services/technologies for submitting and reusing only updated information and transmitting a notification mail by use of a finite difference calculation, DiffWeb (example: non-patent document 4), HTML Diff (example: non-patent document 5), MindIt (example: non-patent document 6) and the like have been known. In these technologies, a finite difference calculation is performed between a “last past page” and a present page, and a content obtained as the difference is utilized. On the contrary, the present invention is greatly different from these technologies in that an object thereof is to “generate a matching pattern.” In addition, in the constituent technologies, the present invention also greatly differ from these technologies in finite difference calculations and statistical processing with past pages in plural versions, a concept of adjacent pages and finite difference calculations therewith, and the like.
(5) Simplification Technology by Finite Difference Calculation (Patent Document 1):
In this technology, specific information is taken out from the page by use of a finite difference calculation, and the information is simplified. Although this technology is common to the present invention in that adjacent pages are listed up and the finite difference calculations are performed therewith, this technology does not suggest a specific method for cutting out a part of the Web content.
(6) Matching Technology for a Tree Structure:
As matching technologies for a tree structure by use thereof, a normal expression matching technology (TRex), a matching of the tree structure based on the hedge automaton theory and an application thereof to schema languages (relax and relaxNG) and the like have been researched. These technologies are technologies for searching subtrees (nodes) to be matched with the tree structure on the premise that a matching pattern exists, and do not suggest that they relate to automatic generation of the matching pattern.
(7) Technology Related to Automatic Generation of Matching Pattern:
There is a technology called “Examplotron” which automatically generates schema description to be matched with a group of XML samples. This technology is similar to the present invention in that a certain type of matching pattern is automatically generated from a group of XML files. However, this technology is different from the present invention to be described later in that a subject thereof is a group of “well-formatted” XML files “in conformity with a certain tacit schema” and that a strict matching pattern is generated by use of an “embedding structure” of the tags as a key.
(8) Efficiency Enhancement for Work of Adding Annotations (Patent Document 2):
A common annotation is added to page files analogous to each other in layout structure, and thus an efficiency enhancement for work of adding annotations is attempted. A determination as to whether the page files are analogous in layout structure is performed based on a collation of structural description formulae, and a matching pattern based on statistical information relating to occurrence modes and occurrence frequencies of nodes is not utilized.
[Patent Document 1]
Japanese Patent Laid-Open No. 2002-55872
[Patent Document 2]
Japanese Patent Laid-Open No. 2002-245068
[Non-Patent Document 1]
WTP (WebSphere Transcoding Publisher,
[Non-Patent Document 2]
CHIP[I] Ito “Construction method of distributed applications by integration of GUI parts and WEB services,” Japan Society for Software Science and Technology WISS 2001 Proceedings
[Non-Patent Document 3]
IBM mysite Outliner
[Non-Patent Document 4]
DiffWeb
[Non-Patent Document 5]
HTML Diff
[Non-Patent Document 6]
MindIt