1. Technical Field
This disclosure relates to video browsing and editing, and more specifically, to a video browsing and editing method and apparatus, which is supported by audio browsing and labeling capabilities.
2. Description of the Related Art
Content of a story is mostly included in audio information, i.e. one must listen to the audio to understand content. The common visual signature of a person is their face, and it is more difficult task to detect and match faces from all angles. In addition, the face of the speaker may not appear in a video while he/she is talking. Visual information in most cases is not enough to tell the story, however. Thus, the silent movies were supported by text. Audio is employed in addition to the visual information to enhance our understanding of the video. Continuous speech recognition with natural language processing can play a crucial role in video understanding and organization. Although current speech recognition engines have quite large vocabularies, continuous speaker and environment independent speech recognition is still, for the most part, out of reach, (see, e.g., Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, xe2x80x9cIntegrated image and speech analysis for content-based video indexing,xe2x80x9d in Proc. of the Int""l Conf. on Multimedia Computing and Systems, (Hiroshima, Japan), pp. 306-313, IEEE Computer Society, Jun. 17-21 1996). Word spotting is a reliable method to extract more information from the speech. Hence, a digital video library system, described by A. Hauptmann and M. Smith in xe2x80x9cText, Speech, and Vision for Video Segmentation: The Informedia(trademark) Project,xe2x80x9d applies speech recognition and text processing techniques to obtain the key-words associated with each acoustic xe2x80x9cparagraphxe2x80x9d whose boundaries are detected by finding silence periods in the audio track. Each acoustic paragraph is matched to the nearest scene break, allowing the generation of an appropriate video paragraph clip in response to a user request. This work has also shown that combining of speech, text and image analysis can provide much more information, thus improving content analysis and abstraction of video compared to using one media only.
There exist many image, text, and audio processing supported methods to understand the content of the video; however, video abstraction is still a very difficult problem. In other words, no automatic system can do the job of a human video cataloger. Automatic techniques for video abstraction are important in the sense that they can make the human cataloger""s job easier. However, the cataloger needs tools to correct the mistakes of the automatic system and interpret the content of the audio and image sequences by looking to the efficient representations of these data.
Content of a story is mostly included in speech information, i.e. one must listen the audio to understand the content of the scene. In addition, music is found to play an important role in expressing the director""s intention by the way it is used. Music also has the effect of strengthening the impression and the activity of dramatic movies. Hence, speech and music are two important events in the audio track. In addition, speech detection is a pre-processing step to speech recognition and speaker identification.
Traditionally silence has been an important event in telecommunications. Silence detection is well known from telephony. Many of the silence detection techniques rely on computing the power of the signal in small audio frames (usually 10 or 20 msec long portions of the signal) and thresholding. However, segmenting audio into speech and music classes is a new research topic. Recently, several methods are proposed for segregation of audio into classes based on feature extraction, training, and classification.
Audio analysis for video content extraction and abstraction can be extended in several ways. All of the above audio segmentation methods fail when the audio contains multiple sources.
Speaker-based segmentation of the audio is widely used for non-linear audio access. Hence, the speech segments of the audio can be further divided into speaker segments. Although speaker verification and identification problems have been studied in detail in the past, segmentation of the speech into different speakers is a relatively new topic.
In review, audio-visual communication is often more effective than only audio only text based communication. Hence, video, and more particularly, digital video is rapidly becoming integrated into typical computing environments. Abstraction of the video data has become increasingly more important. In addition to video abstraction, video indexing and retrieval has become important due to the immense amount of video stored in multimedia databases. However, automatic systems that can analyze the video and then extract reliable content information from the video automatically have not yet been provided. A human cataloger is still needed to organize video content.
Therefore, a need exists for a video browsing system and tools to facilitate a cataloger""s task of analyzing and labeling video. A further need exists for enhancing the existing video browsers by incorporating audio based information spaces.
A system for browsing and editing video, in accordance with the present invention, includes a video source for providing a video document including audio information and an audio classifier coupled to the video source. The audio classifier is adapted to classify audio segments of the audio information into a plurality of classes. An audio spectrogram generator is coupled to the video source for generating spectrograms for the audio information to check that the audio segments have been identified correctly by the audio classifier. A browser is coupled to the audio classifier for searching the classified audio segments for editing and browsing the video document.
In alternate embodiments, the system may include a pitch computer coupled to the video source for computing a pitch curve for the audio information to identify speakers from the audio information. The pitch curve may indicate speaker changes relative to the generated spectrograms to identify and check speaker changes. The speaker changes are preferably identified in a speaker change list. The speaker change list may be identified in accordance with an audio label list wherein the audio label list stores audio labels for identifying the classes of audio. The system may include a speaker change browser for browsing and identifying speakers in the audio information. The system may further include a memory device for storing an audio label list, and the audio label list may store audio labels associated with the audio information for identifying the audio segments as one of speech, music and silence. The system may further include a video browser adapted for browsing and editing video frames. The video browser may provide frame indices for the video frames, the video frames being associated with audio labels and spectrograms to reference the audio information with the video frames.
A method for editing and browsing a video, in accordance with the present invention, includes providing a video clip including audio, segmenting and labeling the audio into music, silence and speech classes in real-time, determining pitch for the speech class to identify and check changes in speakers and browsing the changes in speaker and the audio labels to associate the changes in speaker and the audio labels with frames of the video clip.
In other methods, the step of segmenting and labeling the audio into music, silence and speech classes in real-time may include the step of computing statistical time-domain features, based on a zero crossing rate and a root mean square energy distribution for audio segments. The audio segments may include a length of about 1.2 seconds. The step of segmenting and labeling the audio into music, silence and speech classes in real-time may include the step of classifying audio segments into music or speech classes based on a similarity metric. The similarity metric may be based on a distance measurement between feature vectors.
In still other methods, the step of segmenting and labeling the audio into music, silence and speech classes in real-time may include the step of detecting silence in the audio using energy and zero crossing rates computed for the audio. The step of browsing the changes in speaker and the audio labels to associate the changes in speaker and the audio labels with frames of the video clip may include the steps of editing the video clip based on one of the changes in speaker and the audio labels. The step of editing may include the step of splitting a block including an audio event or speaker segment into two blocks. The step of editing may include the step of merging at least one of the two blocks with another block. The step of editing may include the step of playing the edited video clip. The above methods and method steps may be implemented by a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for editing and browsing a video.