1. Field of the Invention
The present invention relates to a scene information extraction method, scene extraction method and scene extraction apparatus for extracting, from video content, a zone coherent in meaning.
2. Description of the Related Art
A technique is now being contrived for adding meta-data to digital content that is increasing in delivery amount in accordance with, for example, the spread of broadband, thereby enabling the resultant digital content to be managed and processed efficiently by a computer. For instance, in the case of video content, if meta-data as scene information, which clarifies “who”, “what”, “how”, etc., is attached to time-sequence data contained in the video content, it is easy to retrieve or abstract the video content.
However, if content providers must add all appropriate meta-data, they bear a too heavy burden. To avoid this, the following methods for automatically extracting scene information as meta-data from content information have been proposed:
(1) A method for extracting scene information from speech information contained in video content, or from the correspondence between the text information acquired by recognizing the speech information, and text information contained in the acting script corresponding to the video content (see, for example, JP-A 2005-167452 (KOKAI));
(2) A method for extracting scene information from text information, such as a subtitle, contained in video content, or from the correspondence between the text information contained in the video content, and text information contained in the acting script corresponding to the video content (see, for example, JP-A 2005-167452 (KOKAI)); and
(3) A method for extracting scene information from image information, such as cut information extracted from video content
However, the above-described prior art contains the following problems:
When utilizing speech information, abstract scene information indicating, for example, “a rising scene”, can be extracted, based on, for example, the volume of acclamations, or rough scene information can be extracted, based on a characterizing keyword. However, since the accuracy of speech recognition at present is not so high, subtle scene information cannot be extracted. Further, scene information cannot be extracted from a silent zone.
When utilizing text information, scene information can be extracted by anticipating the shift of subjects of conversation from the shift of words appearing in the text information. However, if video content does not contain text information, such as a subtitle or acting script, this method cannot be used. Furthermore, if text information, such as a subtitle, is added to use the method, this inevitably increases the load on content providers. In this case, it would be better to add scene information as meta-data to video data, together with the text information, than to apply the method to the video content after adding the text information thereto.
When utilizing cut information, cut information itself indicates a very primitive zone, which is too small to be regarded as a zone coherent in meaning. Also, in a program, such as a quiz or news program, where a typical sequence of cut information appears, the sequence can be extracted as scene information. However, such a typical sequence does not appear in all programs.
In addition, in the above-described methods (1) to (3), static information contained in video content is utilized. Therefore, the methods cannot be applied to a dynamic change in scene information (such a change as in which the scene regarded as “cool” has come to be regarded as “interesting”).