Conventionally, there is a practice of creating index information by extracting information on features of multimedia contents. For extracting information on features of multimedia contents, manual and automatic methods are available.
There are two methods as the automatic method. One is a method of extracting feature information through a feature information extraction algorithm without depending on the meaning of content and quality of multimedia contents. The other is a method using a recognition technology specialized for respective media, recognizing meanings of the multimedia contents and carrying out structuring.
The former is described, for example, in “On a way to Retrieve Video using Gamon Information” (TECHNICAL REPORT OF IEICE IE98-83). This method uses average color information of each frame composing a moving image as feature information of the moving image. There is also a long-established method of using a histogram of each frame as feature information.
The latter is described, for example, in “Organizing Video by Understanding Media” (TECHNICAL REPORT OF IEICE IE99-18). This method is intended to structure the content of a moving image by combining scenario information, voice information, image information, telop, CC (Closed Caption) information using a combination coordinated technique.
Furthermore, IEICE Transactions D-II Vol. J80-D-II No.6 pp1590 to 1599 (1997) reports on a technology for searching for an arbitrary moving image scene using human words. This method associates a positional relationship between objects, movements and changes with human words beforehand. Then, this method semi-automatically extracts objects in a moving image whose positional relationship or movement has changed corresponding to these human words and thereby searches for an arbitrary moving image scene using human words.
Furthermore, “LIFE MEDIA: Structuring & summarization of personal experience imaging” (TECHNICAL REPORT OF IEICE IE2000-23) reports on a study of associating human sensibilities with moving images. This uses alpha wave and beta wave of human brain waveforms and associates changes in these brain waveforms with the meaning or content of the moving images.
On the other hand, International Organization for Standardization MPEG-7 intends to realize high-function search/abstraction using multimedia contents tagged with viewpoints and viewpoint scores. However, the scope of the MPEG-7 (Moving Picture Experts Group Phase-7) covers no method of creating viewpoint scores, and therefore the method of creating viewpoint scores constitutes a problem in implementation.
With regard to a viewpoint score creating method, a method of manually realizing off-line tagging is mainstream. This method extracts cut points (features) of an image first and then delimits the image at cut points and thereby delimits the image scene by scene. Then, the author determines time information for each cut point and a viewpoint and viewpoint score to be assigned for each scene. Finally, the author creates meta data in an XML (eXtensible Markup Language) format, which is information on multimedia contents, from the determined time information, viewpoint and viewpoint score using an editor. Furthermore, when describing meta data contents, the author manually enters characters using a keyboard. Then, when creating index information, the user of the meta data enters arbitrary viewpoints and viewpoint scores and thereby extracts information on the viewpoints and viewpoint scores input from the contents and creates index information.
However, the conventional viewpoint meta data creation method involves the following problems:
That is, the method of using an average color of each frame and a histogram as a feature method can extract feature information through simple calculations, but can associate the extraction of the feature information with only color information. Thus, scenario information and voice information, etc. are not reflected in the feature information.
On the other hand, the coordinated technique using a plurality of pieces of information such as scenario information and image information contributes to improvement of accuracy of a content analysis. However, while the feature information such as scenario information and image information can be easily detected by a human visual check, it is difficult to mechanically and automatically calculate and detect the feature information. Thus, that technique is concluded to have problems when it is put to practical use.
Furthermore, according to the method of semi-automatically extracting objects in a moving image whose positional relationship or movement has changed, not all human words correspond to changes in the positional relationship and movement between frames. Thus, it is difficult to automatically associate human words with multimedia contents using this method. Furthermore, a relationship between human words and multimedia contents varies from one multimedia content to another. Furthermore, this association is only applicable to specific moving images. Moreover, applications of a moving image in the above described document are limited to specific sports.
Furthermore, it will take considerable time to put to practical use the method of using human brain waveforms which associates feature information with human sensibilities because there are many unknown areas in the structure of a human brain itself.
Furthermore, using the method of automatically extracting this feature information to generate viewpoints and viewpoint scores involves such problems that there are restrictions on the type or the number of viewpoints and viewpoint scores depending on the content of the recognition technology and the accuracy of viewpoints and viewpoint scores depends on the performance of the recognition technology.
Furthermore, there is also a problem that viewpoints and viewpoint scores created may be different in quality from human sensibilities or may not always match human sensibilities. There is also a problem that extracting viewpoints and viewpoint scores often involves complicated apparatuses and processing and increases costs.
Furthermore, according to the technology whereby the meta data author manually creates viewpoints and viewpoint scores off line, an image reproduction apparatus does not operate in concert with contents writing or creation. For this reason, the author needs to record time information of multimedia contents at cut points and the correspondence between scene viewpoints and viewpoint scores successively. Furthermore, the author needs to rewrite this recorded time information, viewpoints and viewpoint scores into a final format. This involves a problem that the processing takes enormous time and costs.