In order to provide a global and a multimedia communication service, early introduction has been made of a portable telephone that is compatible with the IMT-2000 (International Mobile Telecommunications 2000) specifications, which were prepared for the next-generation mobile communication system. For the IMT-2000 compatible portable telephone, a maximum bandwidth of 2 Mbps is provided, and as one application, a video distribution service is also planned. However, because of various extant conditions, such as the limits imposed by available devices (the sizes and resolutions of the devices) and communication fees, using a portable terminal to watch a video for an extended period of time is difficult.
Therefore, a system for summarizing the enormous amount of video contents involved and for providing a video digest is sorely needed. Specifically, the addition to video of a variety of meaningful information is important, so that videos that satisfy the desires of viewers can be selected and extracted from a huge amount of video stored in a video database, and so that video digests can be generated efficiently. This meaningful information is constituted by indexes (meta data), and an individual who assembles and adds indexes to videos is called an index adding person, while an individual who actually watches and listens to generated digests, which are based on added indexes, is called an index user. A conventional index addition technique is based, for example, on image, speech and natural language processes.
For example, in Japanese Unexamined Patent Publication No. Hei 10-112835, No. 2000-175149 and No. 2000-236518, a technique is disclosed for automatically detecting scenes in video (which constitute continuous video segments that are delimited by unmistakable story line changes effected along the time axis of the video), and for using as indexes representative images for individual scenes and frames, positioned at specific intervals from the head, and linking these indexes to prepare video summaries and to generate digests. However, since according to this technique the summarization of a video is based on the characteristics of the video itself, it is difficult to generate a digest based on any meaning attributable to the contents of the video.
According to a technique disclosed in Japanese Unexamined Patent Publication No. Hei 11-331761, words, such as “tsugi-ni (next)” or “tokorode (by the way)”, interjected to effect topical changes are focused on, and video synchronized sub-title information is obtained. To detect topic altering words, obtained information is analyzed, and when one of these words is encountered, it is assumed that there has been a change in the contents of the video and pertinent video is extracted for a specific period of time. This extracted video is used in the construction of indexes, and the indexes that are thus obtained are linked together to prepare a summary of the video. However, to apply this technique, the presence of sub-title information is prerequisite, and the technique can not cope with videos for which such additional information has not been provided.
As a technique that assumes the contents of speech will be dictated, one employs, as indexes, important words encountered in the flow of speech, automatically detected scenes and representative images extracted from individual scenes, information concerning objects depicted in a video, such as portraits or captions, and attendant information, such as a shift in the focus of a camera, that is mainly obtained by the image processing techniques employed (see CMU-CS-95-186, School of Computer Science, Carnegie Mellon University, 1995). Similarly, another proposal has been provided. According to this technique, portraits in a video and the relationship established between portraits correlated with names contained in spoken sentences and the names of persons are employed as indexes, and summaries are prepared based on the individual names (see IJCAI, pp. 1488-1493, 1997). However, this technique also assumes that additional information is present, and thus can not be applied for the common run of video recordings. Another technique involves the identification of a character from a telop in video, and for comparing the character with a character obtained from a closed caption. A character included both in a telop and a closed caption is regarded as a keyword, and the video for the pertinent portion is used as an index. The thus obtained indexes are then linked together to form a video summary (see Transaction of Information Processing Society of Japan, Vol. 40, No. 12-017, 1999). However, according to this technique, it is assumed that closed captions are present, and thus, as for the above technique, video for which no additional information is available can not be coped with.
As a technique for that uses speech recognition, there is one whereby video is summarized by using, as indexes, scenes that are automatically detected using the image processing technique, representative images of these scenes, and information (content, time) concerning the speech that is detected using the speech recognition technique (In Proceedings of SI-GIR '99, p. 326, 1999). This technique can be satisfactorily applied only for limited types of videos, such as news, when there is no background noise or music, and it is difficult to use this technique for other, ordinary types of videos.
Therefore, a technique was developed for adding indexes not only to limited types of videos but also to a wide variety of video types. For example, in Japanese Unexamined Patent Publication No. 2000-23062, telop videos, including telops for the prompt reporting of breaking news and broadly exaggerated telops used for variety programs, that are obtained for video are used as video indexes, and speech, volume values and tone information types are used as speech indexes. Further, personal information is prepared, and based on this information, feature tones are represented using the speech indexes, and corresponding video portions are extracted. When an extracted video portion and telop video are combined and feature video is employed, a digest can be generated for an entire video. However, according to this technique, while speech information is regarded as important, only external information factors are employed, so that it is difficult to generate digests based on the meaning of videos, and videos for which no telops are provided can not be coped with.
An additional technique is disclosed in Japanese Unexamined Patent Publication No. Hei 10-150629. According to this technique, an index adding person sets a “scene”, which is a unit wherein a set of contents is expressed, and a “scene group”, which consists of a number of scenes, selects an image for each scene and each scene group, and uses the representative images as indexes. According to this technique, the index adding person can use the indexes to generate a video digest corresponding to a situation; however, first, the person must understand the contents of the video and determine which scenes are important.
Therefore, although a digest can be generated based on the meaning of the video contents, an extended period of time is required for the addition of indexes.
When digital broadcasting or video distribution is performed in the future, by contents providers using portable telephones or portable terminals, such as PDAs (Personal Digital Assistants), video distribution services will be established that provide more variety and are more efficient than the conventional services that are presently available. Especially while taking into account the convenience offered by portable terminals when used for video distribution, there will be an increased demand for video digests covering events such as live broadcasts of sports, for which immediacy is requisite. Further, since the time spent actually watching video will be greatly restricted by limitations imposed by the communication fees charged for portable terminals, the demand will be for digests that satisfy the interests and tastes of users and that, appropriately, are short enough for distribution to portable terminals.
As is described above, to efficiently generate digests, indexes must be added to videos. For the distribution of entertainment matter such as movies, dramas or documentaries, for which immediacy of contents is not highly important, indexes can be added to videos using the conventional techniques now employed by broadcast stations. However, when indexes are added to videos for which immediacy is of vital concern, such as live sports broadcasts, the index addition process must be implemented in real time and the reliability of the indexes that are added must be questioned. Conventional techniques will not suffice for the resolution of the problems posed by these two requirements.