1. Field of the Invention
The present invention relates to a technique for transcoding information (e.g. digital content, such as a web page) on a network and for distributing the transcoded information, and in particular to a transcoding technique based on an annotation prepared for the information.
2. Description of the Related Art
When access to certain information on a network is requested by a predetermined terminal device, the desired information can be converted, in accordance with specifications of the terminal device or its use environment, to be presented to the terminal device. The conversion technique is called “transcoding” technique. For example, to provide web content on the Internet, the structure of a web page can be adjusted by transcoding, thereby permitting the web content to be fitted into the small display screen of a portable information terminal, or the structure can be altered and adapted for use by a speech browser for voice synthesis.
Roughly speaking, there are two transcoding methods. One is a method for which no additional information is employed. The other is a method using external meta information (annotation). According to the transcoding method for which no additional information is employed, all web contents can be transcoded, regardless of the types and contents of the web data. However, because the types and contents of web data are not taken into account, the transcoding accuracy is low. On the other hand, according to the transcoding method based on annotation data, since an appropriate transcoding method is performed based on annotations that correspond to web contents, the transcoding accuracy is high. However, since much labor and high costs are required to input meta information for annotation, annotation information cannot be added to all web contents, and the number of web contents that can be transcoded is limited. Therefore, in order to transcode more web contents at high accuracy, what is important is how workload for adding an annotation should be reduced.
FIG. 1 is a diagram for explaining the system configuration for performing transcoding based on annotations. In FIG. 1, a transcoding system comprises: a transcoder 910 for converting (transcoding) web content; and an annotation database system 920 in which annotation files used for transcoding is stored. In FIG. 2, when a terminal device 940 issues an access request to a web server 930, the web server 930 returns target web content to be accessed, and the transcoder 910 receives the web content first. The transcoder 910 refers to the annotation database system 920, and transcodes the web content based on data, contained in an annotation file (and hereinafter referred to simply as an annotation), that corresponds to the web content. Thereafter, the obtained web content is transmitted by the transcoder 910 to the terminal device 940.
As a countermeasure for reducing the workload required by the thus arranged system to add an annotation for the transcoding process, it is important that an annotation authoring tool be prepared. Further, one annotation may also be employed for different web contents having the same layout. The conventional methods for correlating one annotation with multiple web contents can be sorted into three types.    1. The correlation between URLs (Uniform Resource Locators) and annotations is stored as table data (correlation table data).    2. A regular expression of URL is employed.    3. An annotation to be employed is dynamically determined by using a table structure of the web content (an automatic determination).
As is described above, when conversion using the transcoding technique is performed to provide information on a network, the transcoding method based on annotations is employed in order to attain high transcoding accuracy. However, since many workloads and high costs are required for the input of meta information for annotations, a typical network system, such as the Internet, cannot add an annotation to all the information, i.e., all the web contents, and the number of web contents that can be transcoded is limited. In order to reduce the workloads required to add an annotation, the above described method for correlating one annotation with multiple web contents has been proposed. However, for the method 1 whereby the correlation between URLs and annotations is stored as table data, it is not practical for the table content to be updated frequently in order to cope with new URLs that are generated day after day. Therefore, this method cannot be employed especially for a web page used for describing news articles or search results obtained by a search engine.
For the method 2 using the regular expression of URL, the author of an annotation must analyze the URL structure of a web site and describe a complicated regular expression, so a great deal of workloads are required. Further, this method cannot cope with web contents whose layouts are dynamically changed using cookie data. If the method using a regular expression of URL is employed together with an XPath wildcard designating a specific portion of an HTML document, the web content whose layout is to be changed dynamically can be coped with to some extent. In this case, overall, the URL structure of the web site is thoroughly analyzed, and a URL condition on which the same layout appears is determined. And if the web content cannot be handled by the regular expression, the XPath wildcard is employed to provide a wider use of the method for various purposes.
FIGS. 2A and 2B are schematic diagrams showing example layouts for a web page on which news articles are described. The layout in FIG. 2A differs from the layout in 2B in that a table “Top news” is inserted. The “Top news” table is arbitrarily added or deleted by a person acting as a web content manager. In this case, assume that a regular expression can be obtained for a URL that specifies the two web pages in common in FIGS. 2A and 2B, and that the XPath for the web pages is written as follows./html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[1]
If a wildcard is introduced in order to add or delete the “Top news”, the XPath is written as follows./html[1]/body[1]/table[7]/tbody[1]/tr[1]/td[3]/table[starts-w ith(child::tbody[1]/tr[1]/td[1]/table[1]/tbody[1]/tr[1]/td[1], ‘▪Top news’)]
However, since these operations are so complicated and the description of the XPath also becomes complicated, a lot of workloads are imposed on the author of the annotation. Furthermore, although the method for employing the XPath wildcard to change the layout can cope with a simple change, such as the addition or deletion of a visually semantic block (a header, a footer, a link list, main text and an advertisement; hereinafter referred to as a group) that is an element or component of the web content and is represented by a certain layout (e.g. a background color), it is difficult to handle a major change affecting the entire layout.
Further, even for specific web contents at the same URL, the layout may be dynamically changed based on other web contents that have been passed through before the specific web contents are reached. Similarly, the layout may be dynamically changed by re-loading the web content using the same URL. In these cases, to add an annotation, using the regular expression of URL is not sufficient to handle them, and the XPath wildcard must be employed. However, when there is a major change in the layout, it is difficult for such change to be handled with by the XPath In addition, there are many web pages on which the results obtained by a search engine are displayed. The layout of such pages tends to be changed greatly, depending on whether a search target (a page, a product, a book, etc.) corresponding to a matched keyword is present or not. In this case it is also difficult to cope with the web pages by traditional way of using the regular expression of URL and the XPath.
Furthermore, in the method 3 for correlating one annotation with multiple web contents by employing the table structure of web contents to dynamically determine which annotation is to be used, the table used for specifying a layout is employed as criteria (references) for determination. Thus, an appropriate annotation cannot be determined when a table in a web content is not used for a layout purpose, or when a layout having the same form but different content is employed. If the determination criteria is more strictly applied in order to avoid an erroneous determination (e.g. different layouts are regarded as being the same), layouts that are basically the same may be judged to be different and an erroneous determination could not be avoided.
It is, therefore, one object of the present invention to correctly employ an annotation for multiple web contents and to thus efficiently reduce the workloads required for adding an annotation during the transcoding process.
It is another object of the present invention to provide a tool for simplifying the addition of an annotation to web content.