Segmenting scripted or unscripted video content is a key task in video retrieval and browsing applications. A video can be segmented by identifying highlights. A highlight is any portion of the video that contains a key or remarkable event. Because the highlights capture the essence of the video, highlight segments can provide a good summary of the video. For example, in a video of a sporting event, a summary would include scoring events and exciting plays.
FIG. 1 shows one typical prior art audio classification method 100, see Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran and Thomas S. Huang, “Effective and Efficient Sports Highlights Extraction Using the Minimum Description Length Criterion in Selecting GMM Structures,” Intl. Conf. on Multimedia and Expo, June 2004; and U.S. patent application Ser. No. 10/922,781 “Feature Identification of Events in Multimedia,” filed on Aug. 20, 2004 by Radhakrishnan et al., both incorporated herein by reference.
An audio signal 101 is the input. Features 111 are extracted 110 from frames 102 of the audio signal 101. The features 111 can be in the form of modified discrete cosine transforms (MDCTs).
As also shown in FIG. 2, the features 111 are classified as labels 121 by a generic multi-way classifier 200. The generic multi-way classifier 200 has a general set of trained audio classes 210, e.g., applause, cheering, music, normal speech, and excited speech. Each audio class is modeled by a Gaussian mixture model (GMM). The parameters of the GMMs are determined from features extracted from training data 211.
The GMMs of the features 111 of the frames 102 are classified by determining a likelihood that the GMM of the features 111 corresponds to the GMM for each class, and comparing 220 the likelihoods. The class with the maximum likelihood is selected as the label 121 of a frame of features.
In the generic classifier 200, each class is trained separately. The number m of Gaussian mixture components of each model is based on minimum description length (MDL) criteria. The MDL criteria are commonly used when training generative models. The MDL criteria for input training data 211 can have a form:MDL(m)=−log p(data|Θ,m)−log p(Θ|m),   (1)where m indexes mixture components of a particular model with parameters Θ, and p is the likelihood or probability.
The first term of Equation (1) is the log likelihood of the training data under a m mixture component model. This can also be considered as an average code length of the data with respect to the m mixture model. The second term can be interpreted as an average code length for the model parameters Θ. Using these two terms, the MDL criteria balance identifying a particular model that most likely describes the training data with the number of parameters required to describe that model.
A search is made over a range of values for k, e.g., a range between 1 and 40. For each value k, a value Θk is determined using an expectation maximization (EM) optimization process that maximizes the data likelihood term and the MDL score is calculated accordingly. The value k with the minimum expectation score is selected. Using the MDL to train the GMMs of the classes 210 comes with an implicit assumption that selecting a good generative GMM for each audio class separately yields better general classification performance.
The determination 130 of the importance levels 131 is dependent on a task 140 or application. For example, the importance levels correspond to a percentage of frames that are labeled as important for a particular summarization task. In a sports highlighting task, the important classes can be excited speech or cheering. In a concert highlighting task, the important class can be music. By setting thresholds on the importance levels, different segmentations and summarizations can be obtained for the video content.
By selecting an appropriate set of classes 210 and a comparable generic multi-way classifier 200, only the determination 130 of the importance levels 131 needs to dependent on the task 140. Thus, different tasks can be associated with the classifier. This simplifies the implementation to work with a single classifier.