The amount of entertainment, information, and news that is available on videos is rapidly increasing. Therefore, there is a need for efficient video browsing techniques. Generally, video can include three “tracks” that could be used for browsing, visual, audio, and textual (close-captions).
Most videos have story or topic structures, which are reflected in the visual track. The fundamental unit of the visual track is a shot or scene, which captures continuous action. Therefore, many video browsers expect that the video is first partitioned into story or topic segments. Scene change detection, also called temporal segmentation, indicates when a shot starts and ends. Scene detection can be done with DCT coefficient in the compressed domain. Frames can then be selected from the segments to form a summary of the video, which can then be browsed rapidly, and used as an index into the entire video. However, video summaries do not provide any information about the content that is summarized.
Another technique uses representative frames to organize the visual content of the video. However, so far, meaningful frame selection processes require manual intervention.
Another technique uses a language-based model that matches the audio track of an incoming video with expected grammatical elements of a news broadcast, and uses a priori models of the expected content of the video clip to parse the video. However, language-based models require speech recognition, which is known to be slow and error prone.
In the prior art, topic detection has been carried out using closed caption information, embedded captions and text obtained through speech recognition, by themselves or in combination with each other, see Hanjalic et al., “Dancers: Delft advanced news retrieval system,” IS&T/SPIE Electronic Imaging 2001: Storage and retrieval for Media Databases, 2001, and Jasinschi et al., “Integrated multimedia processing for topic segmentation and classification,” ICIP-2001, pp. 366-369, 2001. In those approaches, text is extracted from the video using some or all of the aforementioned sources and then the text is processed using various heuristics to extract the topics.
News anchor detection has been carried out using color, motion, texture and audio features. For example, one technique uses the audio track for speaker separation and the visual track to locate faces. The speaker separation first classifies audio segments into categories of speech and non-speech. The speech segments are then used to train Gaussian mixture models for each speaker, see Wang et al., “Multimedia Content Analysis,” IEEE Signal Processing Magazine, November 2000.
Motion-based video browsing is also known in the prior art, see U.S. patent application Ser. No. 09/845,009 “Video Summarization Using Descriptors of Motion Activity” filed by Divakaran et al. on Apr. 27, 2001, incorporated herein by reference. That system is efficient because it relies on simple computation in the compressed domain. Thus, that system can be used to rapidly generate a visual summaries of a video. However, to use for news video browsing, that method requires a topic list. If the topic list is not available, then the video may be segmented that in some way that is inconsistent with semantics of the content.
Of special interest to the present invention is using sound recognition for video browsing. For example, in videos, it may be desired to identify the most frequent speakers, the principal cast, or news “anchors.” If this could be done for a video of news broadcasts, for example, it would be possible to locate the beginning of each topic or “story” covered by the news video. Thus, it would be possible to skim rapidly through the video, only playing back a small portion starting where one of the news anchors begins to speak.
Because news videos are typically arranged topic-wise in segments and the news anchor introduces each topic at the beginning of each segment, prior art news video browsing work has emphasized news anchor detection and topic detection. Thus, by knowing the topic boundaries, the user can skim through the news video from topic to topic until the desired topic is located, and then the desired topic can be viewed in its entirety.
Therefore, it is still desired to use the audio track during for video browsing. However, as stated above, speech recognition is time consuming and error prone. Unlike speech recognition, which deals primarily with the specific problem of recognizing spoken words, sound recognition deals with the more general problem of characterizing and identifying audio signals, for example, animal sounds, different genres of music, musical instruments, natural sounds such as the rustling of leaves, glass breaking, or the crackling of a fire, animal sounds such as dogs barking, as well as human speech—adult, child, male or female. Sound recognition is not concerned with deciphering the content, but rather with characterizing the content.
One sound recognition system is described by Casey, in “MPEG-7 Sound-Recognition Tools,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, June 2001, and U.S. Pat. No. 6,321,200, issued to Casey on Nov. 20, 2001, “Method for extracting features from a mixture of signals.” Casey uses reduced rank spectra of the audio signal and minimum-entropy priors. As an advantage, the Casey method allows one to annotate an MPEG-7 video with audio descriptors that are easy to analyze and detect, see “Multimedia Content Description Interface,” of “MPEG-7 Context, Objectives and Technical Roadmap,” ISO/IEC N2861, July 1999. Note that Casey's method involves both classification of a sound into a category as well as generation of a corresponding feature vector.