This invention relates generally to multimedia content, and more particularly to extracting high-level features from low-level features of the multimedia content.
Video analysis can be defined as processing a video with the intention of understanding its content. The understanding can range from a xe2x80x9clow-levelxe2x80x9d understanding, such as detecting shot boundaries in the video, to a xe2x80x9chigh-levelxe2x80x9d understanding, such as detecting a genre of the video. The low-level understanding can be achieved by analyzing low-level features, such as color, motion, texture, shape, and the like, to generate content descriptions. The content description can then be used to index the video.
The proposed MPEG-7 standard provides a framework for such content description. MPEG-7 is the most recent standardization effort taken on by the MPEG committee and is formally called xe2x80x9cMultimedia Content Description Interface,xe2x80x9d see xe2x80x9cMPEG-7 Context, Objectives and Technical Roadmap,xe2x80x9d ISO/IEC N2861, July 1999.
Essentially, this standard plans to incorporate a set of descriptors and description schemes that can be used to describe various types of multimedia content. The descriptor and description schemes are associated with the content itself and allow for fast and efficient searching of material that is of interest to a particular user. It is important to note that this standard is not meant to replace previous coding standards, rather, it builds on other standard representations, especially MPEG-4, because the multimedia content can be decomposed into different objects and each object can be assigned a unique set of descriptors. Also, the standard is independent of the format in which the content is stored.
The primary application of MPEG-7 is expected to be search and retrieval applications, see xe2x80x9cMPEG-7 Applications,xe2x80x9d ISO/IEC N2861, July 1999. In a simple application environment, a user may specify some attributes of a particular video object. At this low-level of representation, these attributes may include descriptors that describe the texture, motion and shape of the particular video object. A method of representing and comparing shapes has been described in U.S. patent application Ser. No. 09/326,759 xe2x80x9cMethod for Ordering Image Spaces to Represent Object Shapes,xe2x80x9d filed on Jun. 4, 1999 by Lin et al., and a method for describing the motion activity has been described in U.S. patent application Ser. No. 09/406,444 xe2x80x9cActivity Descriptor for Video Sequencesxe2x80x9d filed on Sep. 27, 1999 by Divakaran et al.
To obtain a high-level representation, one may consider more elaborate description schemes that combine several low-level descriptors. In fact, these description schemes may even contain other description schemes, see xe2x80x9cMPEG-7 Multimedia Description Schemes WD (V1.0),xe2x80x9d ISO/IEC N3113, December 1999 and U.S. patent application Ser. No. 09/385,169 xe2x80x9cMethod for representing and comparing multimedia content,xe2x80x9d filed Aug. 30, 1999 by Lin et al.
The descriptors and description schemes that will be provided by the MPEG-7 standard can be considered as either low-level syntactic or high-level semantic, where the syntactic information refers to physical and logical signal aspects of the content, and the semantic information refers to conceptual meanings of the content.
In the following, these high-level semantic features will sometimes also be referred to as xe2x80x9cevents.xe2x80x9d
For a video, the syntactic events may be related to the color, shape and motion of a particular video object. On the other hand, the semantic events generally refer to information that cannot be extracted from low-level descriptors, such as the time, name, or place of an event, e.g., the name of a person in the video.
However, automatic and semi-automatic extraction of high-level or semantic features such as video genre, event semantics, etc., is still an open topic for research. For instance, it is straightforward to extract the motion, color, shape, and texture from a video of a football event, and to establish low-level similarity with another football video based on the extracted low-level features. These techniques are well described. However, it is not straightforward to automatically identify the video as that of a football event from its low-level features.
A number of extraction techniques are known in the prior art, see for example, Chen et al., xe2x80x9cViBE: A New Paradigm for Video Database Browsing and Search Proc,xe2x80x9d IEEE Workshop on Content-Based Access of Image and Video Databases, 1998, Zhong et al., xe2x80x9cClustering Methods for Video Browsing and Annotation,xe2x80x9d SPIE Conference on Storage and Retrieval for Image and Video Databases, Vol. 2670, February, 1996, Kender et al., xe2x80x9cVideo Scene Segmentation via Continuous Video Coherence,xe2x80x9d In IEEE CVPR, 1998, Yeung et al., xe2x80x9cTime-constrained Clustering for Segmentation of Video into Story Units,xe2x80x9d ICPR, Vol. C. August 1996, and Yeo et al, xe2x80x9c,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, December 1995.
Most of these techniques first segment the video into shots using low-level features extracted from individual frames. Then, the shots are grouped into scenes using the extracted features. Based on this extraction and grouping, these techniques usually build a hierarchical structure of the video content.
The problem with these approaches is that they are not flexible. Thus, it is difficult to do a detailed analysis to bridge the gap between low-level features and high-level features, such as semantic events. Moreover, too much information is lost during the segmentation process.
Therefore, it is desired to provide a system and apparatus that can extract high-level features from a video without first segmenting the video into shots.
It is an object of the invention to provide automatic content analysis using frame-based, low-level features. The invention, first extracts features at the frame level and then labels each frame based on each of the extracted features. For example, if three features are used, color, motion, and audio, each frame has at least three labels, i.e., color, motion, and audio labels.
This reduces the video to multiple sequences of labels, there being one sequence of labels for feature common among consecutive frames. The multiple label sequences retain considerable information, while simultaneously reducing the video into a simple form. It should be apparent to those of ordinary skill in the art, that the amount of data required to code the labels is orders of magnitude less than the data that encodes the video itself. This simple form enables machine learning techniques such as Hidden Markov Models (HMM), Bayesian Networks, Decision Trees, and the like, to perform high-level feature extraction.
The procedures according to the invention, offer a way to combine low-level features that performs well. The high-level feature extraction system according to the invention provides an open framework that enables easy integration with new features. Furthermore, the invention can be integrated with traditional methods of video analysis. The invented system provides functionalities at different granularities that can be applied to applications with different requirements. The invention also provides a system for flexible browsing or visualization using individual low-level features or their combinations. Finally, the feature extraction according to the invention can be performed in the compressed domain for fast, and preferably real-time, system performance. Note that the extraction need not necessarily be in the compressed domain, even though the compressed domain extraction is preferred.
More particularly, the invention provides a system an method for extracting high-level features from a video including a sequence of frames. Low-level features are extracted from each frame of the video. Each frame of the video is labeled according to the extracted low-level features to generate sequences of labels. Each sequence of labels is associated with one of the extracted low-level feature. The sequences of labels are analyzed using learning machine learning techniques to extract high-level features of the video.