Most prior art methods for summarizing multimedia content have focused on detecting known patterns of events in the content to provide a summary of the content. As a result, the patterns of events that are useful for summarizing are limited to particular known genres of multimedia. It is also well known how to extract the patterns using supervised statistical learning tools.
For the genre of news videos, detection of ‘story’ boundaries, by closed caption text, speech transcript analysis, and speaker-based segmentation have been shown to be useful, Rainer, “Automatic text recognition for video indexing,” Proc. ACM Multimedia, 1996, and Hsu et al., “A statistical framework for fusing mid-level perceptual features in news story segmentation,” Proc. of ICME, 2003.
For the genre of situation comedies, detection of physical settings using mosaic representation of a scene, and detection of leading cast characters using audio-visual cues have been shown to be useful, Aner et al., “Video summaries through mosaic-based shot and scene clustering,” Proc. European Conference on Computer Vision, 2002, and Li, “Content-based video analysis, indexing and representation using multimodal information,” Ph.D Thesis, University of Southern California, 2003.
For sports video summarization, some methods detect domain-specific events that are correlated with highlights using audio-visual cues, Pan et al., “Detection of slow-motion replay segments in sports video for highlights generation,” Proc. IEEE International Conf. on Acoustics, Speech and Signal Processing, 2001, and Xu et al., “Creating audio keywords for event detection in soccer video,” Proc. of ICME, 2003. Another method extracts play-break segments in an unsupervised manner, Xie et al., “Unsupervised mining of statistical temporal structures in video,” Video Mining, Rosenfeld et al. Eds, Kluwer Academic Publishers, 2003.
For movie content, detection of syntactic structures, such as scenes with only two speakers, and the detection of ‘unusual’ events, such as explosions have been shown to be useful, Sundaram et al., “Determining computable scenes in films and their structures using audio-visual memory models,” ACM Multimedia, 2000.
For surveillance content, detection of ‘unusual’ events using object segmentation and tracking from video has been shown to be effective, Wu et al., “Multi-camera spatio-temporal fusion and biased sequence data learning for security surveillance,” ACM Multimedia, 2003.
The following U.S. Patents and Patent Applications also describe methods for extracting features and detecting events in multimedia, and summarizing multimedia, U.S. patent application Ser. No. 09/518,937, “Method for Ordering Data Structures in Multimedia,” filed Mar. 6, 2000 by Divakaran, et al., U.S. patent application Ser. No. 09/610,763, “Extraction of Semantic and Higher Level Features from Low-Level Features of Multimedia Content,” filed on Jul. 6, 2000, by Divakaran, et al., U.S. Pat. No. 6,697,523, “Video Summarization Using Motion and Color Descriptors,” issued to Divakaran on Feb. 24, 2004, U.S. patent application Ser. No. 09/845,009, “Method for Summarizing a Video Using Motion Descriptors,” filed on Apr. 27, 2001 by Divakaran, et al., U.S. patent application Ser. No. 10/610,467, “Method for Detecting Short Term Unusual Events in Videos,” filed by Divakaran, et al. on Jun. 30, 2003, and U.S. patent application Ser. No. 10/729,164, “Audio-visual Highlights Detection Using Hidden Markov Models,” filed by Divakaran, et al. on Dec. 5, 2003. All of the above are incorporated herein by reference.
Even though it is known how to detect specific events for some specific genres of multimedia, a generalized detection task remains a problem due to intra-genre variations as a result of differing multimedia production styles used by different content providers, and other factors. For instance, events in surveillance videos can never be anticipated. Otherwise surveillance videos would not be necessary. Thus, it is impossible to construct supervised models for event detection for many genres of videos.
An additional problem is to identify specific features in the content that are associated with specific events. For example, identifying which types of visual and audio cues are available in the content to assist the task of event detection.
Clearly, there is a need for a method that can identify features that are associated with events.
Following are some of the desired requirements for multimedia summarization and event detection.
First and foremost, the method should be content-adaptive and unsupervised. Second, the method should have a common feature extraction and statistical analysis framework to discover patterns of events. Then, the same feature extraction process can be used as a front-end for all genres of multimedia, and the same post-processing stage can act upon discovered patterns to identify events, even if the meaning of what is unusual changes depending on the genre of the multimedia. The method should also incorporate a ranking scheme for detected events so that an appropriate summary can be determined.