Segmenting videos is an important task in many video summarization, retrieval and browsing applications. As used herein, a video includes video content containing visual information (pixels), and audio content containing audio information (acoustic signals). The video content and the audio content are synchronized. The content can be unscripted or scripted.
Unscripted content, such as content acquired from surveillance and sport events, can be segmented by identifying highlights. A highlight is any portion of the video that contains an unusual or interesting event. Because the highlights can capture the essence of the video, segments of the video containing just highlights can provide a summary of the video. For example, in a video of a sporting event, a summary can include scoring opportunities.
Scripted content, such as news and drama, is usually structured as a sequence of scenes. One can get an essence of the content by viewing representative scenes or portions thereof. Hence, table of contents (ToC) based video browsing provides a summarization of scripted content. For instance, a news video composed of a sequence of news stories can be summarized or browsed using a key-frame representation for each portion in a story. For extraction of the ToC, segmentation is often used.
Video segmentation based on the visual content is known. Typically, low-level features, such as color intensities and motion, are used. However, such segmentation can be complex and time consuming because the underlying data set (pixels) is large and complex. Accurate visual segmentation is usually genre specific and not applicable to any type of content. Correct feature selection can be critical for a successful visual segmentation.
Videos can also be segmented using the audio content. Low-level acoustic features are extracted from the audio content. The low-level features typically represent periodicity, randomness and spectral characteristics of the audio content. Correlations with known data can then determine optimal thresholds for scene segmentation.
Most audio content can be classified into small number of audio classes, e.g., speech, music, silence, applause and cheering.
FIG. 1 shows one typical prior art audio classification method 100. Audio content 101 is the input to the method 100. The audio content 101 can be part of a video 103. The audio content can be synchronized with video content 104. Audio features 111 are extracted 110 from relatively short frames 102 of the audio content 101, e.g., the frames are about ten milliseconds. The audio features 111 can have a number of different forms, e.g., modified discrete cosine transforms (MDCTs) or mel-frequency cepstral coefficients (MFCC).
As also shown in FIG. 2, the audio features 111 in each frame are classified with a label to generate a sequence of consecutive labels 121 by a classifier 200. Each label represents one of the audio classes, e.g., applause, cheering, music, speech, and silence. The classifier 200 has a set of trained classes 210, e.g., applause, cheering, music, speech, and silence. Each class is modeled by, e.g., a Gaussian mixture model (GMM). The parameters of the GMMs are determined from low-level features extracted from training data 211. The audio features 111 can be classified by determining 220 a likelihood that the GMMs of the audio features 111 in the content correspond to the GMMs for each trained class. Thus, the labels 121 can be considered time series data that represent a low-low-level temporally evolution of a semantic interpretation of the audio content.