The rapidly increasing amount of digital multimedia data (e.g., digital still images and digital videos) makes the automatic classification of multimedia content an important problem. A wide range of semantic concepts can be used to represent multimedia content, such as objects (e.g., dog), scenes (e.g., beach) and events (e.g., birthday). Semantic concept classification in generic, unconstrained videos (e.g., videos captured by consumers and posted on YouTube) is a difficult problem, because such videos are captured in an unrestricted manner. These videos have diverse video content, as well as challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera.
A lot of effort has been devoted to developing methods to classify general semantic concepts in generic videos. Examples of proposed approaches include the TRECVid high-level feature extraction method described by Smeaton et al. in the article “Evaluation campaigns and TRECVid” (Proc. 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321-330, 2006) and the Columbia Consumer Video (CCV) concept classification method described by Jiang et al. in the article “Consumer video understanding: A benchmark database and an evaluation of human and machine performance” (Proc. 1st ACM International Conference on Multimedia Retrieval, 2011).
Most prior art approaches classify videos in the same way they classify images, using mainly visual information. Specifically, visual features are extracted from either two-dimensional (2-D) keyframes or three-dimensional (3-D) local volumes, and these features are treated as individual static descriptors to train concept classifiers. Among these methods, the ones using the “Bag-of-Words” (BoW) representation over 2-D or 3-D local descriptors (e.g., SIFT) are considered state-of-the-art, due to the effectiveness of BoW features in classifying objects and human actions.
The importance of incorporating audio information to facilitate semantic concept classification has been discovered by several previous works. (For example, see the aforementioned article by Jiang et al. entitled “Consumer video understanding: A benchmark database and an evaluation of human and machine performance.”) Such approaches generally use a multi-modal fusion strategy (e.g., early fusion to train classifiers with concatenated audio and visual features, or late fusion to combine judgments from classifiers built over individual modalities).
Cristani et al. in the article “Audio-visual event recognition in surveillance video sequences” (IEEE Transactions Multimedia, Vol. 9, pp. 257-267, 2007) describe a video classification method that integrates audio and visual information for scene analysis in a typical surveillance scenario. Visual information is analyzed to detect visual background and foreground information, and audio information is analyzed to detect audio background and foreground information. The integration of audio and visual data is subsequently performed by exploiting the concept of synchrony between such events.
There remains a need for a video classification method that better leverages temporal audio-visual correlation in order to provide more reliable and more efficient semantic classification.