The popularity of video cameras has increased in recent years due to progressive price reductions. Most video cameras use magnetic videotapes to store captured video scenes in analog or digital format. Magnetic videotapes are relatively inexpensive and can store a large quantity of video. A single magnetic videotape may include multiple video scenes. A video scene may be defined as a video sequence having a common subject over a contiguous period of time and space. Therefore, a video scene contains a story or at least contains an independent semantic meaning. A video scene can include one or more video shots. A video shot is a video segment captured continuously over a period of time.
The use of magnetic videotapes does have some disadvantages over other forms of video storage. One of the main disadvantages is that retrieving one or more desired video scenes or shots can be a challenging task. Since captured video scenes are linearly stored on videotape with respect to time, a user may have to search the entire videotape to find a desired video scene or shot. The difficulty in finding a desired video scene or shot is compounded when there are multiple videotapes that may contain the desired video scene or shot.
One solution to more easily retrieve desired video scenes or shots from videotapes is to transfer the contents of the videotapes to a video indexing device, such as a personal computer with a video indexing software. If the video scenes are stored in videotapes in an analog format, then the video scenes are first converted into a digital format. In the digital format, video indices can be generated to “mark” the different video scenes and shots. These video indices can be generated automatically using a conventional video indexing algorithm. The video indexing algorithm may detect visual changes between video scenes and shots to identify and index the video scenes and shots. The video indexing algorithm may also select a significant video frame (“keyframe”) from each video scene that best represent that video scene. Similarly, the video indexing algorithm may also select a keyframe from each video shot that best represent that video shot. A single keyframe may represent both a video scene and a video shot of the scene. The keyframes for the video scenes and shots are subsequently presented to the user so that desired video scenes or shots can be easily retrieved.
A concern with the conventional video indexing algorithm is that the indexed video scenes and shots cannot be retrieved based on audio content. Since video scenes and shots are indexed according to visual information, a user cannot selectively retrieve video segments, which may be video scenes, video shots or other portions of video, that include desired audio content, such as speech from a particular speaker. In many situations, a user may want to retrieve only video segments during which a particular speaker is talking. With the conventional video indexing algorithm, if the keyframes do not provide any visual indication of a desired speaker, then the user cannot select the video scenes or shots that contain speech from that speaker. In addition, since the conventional video indexing algorithm uses only visual information, the indexed video scene or shot may or may not contain speech. Even if a video scene or shot contains speech from a desired speaker, only a small segment of the video scene or shot may contain speech of that speaker. Thus, the user may have to unnecessarily watch the entire video scene or shot.
In view of the above-described concern, there is a need for a system and method for indexing videos based on audio information contained in the videos.