The capture and sharing of digital videos has become increasingly popular. As the number of videos that are available for viewing has increased, the development of methods to organize and search collections of videos has become increasingly important. An important technology in support of these goals is the classification of unconstrained videos according to semantic concepts by automatic analysis of video content. These semantic concepts include generic categories, such as scene (e.g., beach, sunset), event (e.g., birthday, wedding), location (e.g., museum, playground) and object (e.g., animal, boat). Unconstrained videos are captured in an unrestricted manner, like those videos taken by consumers and posted on internet sites such as YouTube. This is a difficult problem due to the diverse video content as well as the challenging condition such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera.
To exploit the power of both visual and audio aspects for video concept detection, multi-modal fusion approaches have attracted much interest. For example, see the article “Biologically motivated audio-visual cue integration for object categorization” by J. Anemueller and et al. (Proc. International Conference on Cognitive Systems, 2008), and the article “Large-scale multimodal semantic concept detection for consumer video” by S. F. Chang, et al. (Proc. 9th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2007). With these approaches, visual features over global images such as color and texture are extracted from image frames, and audio features such as MFCC coefficients are generated from the audio signal in the same time window.
In early fusion methods, such audio and visual raw features are either directly fused by concatenation to train classifiers or used to generate individual kernels which are then added up into a fused kernel for classification. In more recent fusion approaches, concept detectors are first trained over audio and visual features, respectively, and then fused to generate the final detection results. These fusion methods have shown promising results with performance improvements. However, the global visual feature is insufficient to capture the object information, and the disjoint process of extracting audio and visual features limits the ability to generate joint audio-visual patterns that are useful for concept detection.
There are a number of recent works exploring audio-visual analysis for object detection and tracking. In the field of audio-visual speech recognition, visual features obtained by tracking the movement of lips and mouths have been combined with audio features to provide improved speech recognition. (See: K. Iwano, et al., “Audio-visual speech recognition using lip information extracted from side-face images,” EURASIP Journal on Audio, Speech, and Music Processing, 2007.)
In the field of audio-visual object detection and tracking, synchronized visual foreground objects and audio background sounds have been used for object detection. (M. Cristani, et al., “Audio-visual event recognition in surveillance video sequences,” IEEE Trans. Multimedia, Vol. 9, pp. 257-267, 2007.)
In the article “A graphical model for audiovisual object tracking,” published in IEEE Trans. Pattern Analysis and Machine Intelligence (Vol. 25, pp. 828-836, 2003), M. J. Beal, et al., show that by using multiple cameras to capture the object motion, the joint probabilistic model of both audio and visual signals can be used to improve object tracking.
In audio-visual localization, under the assumption that fast moving pixels make big sounds, temporal patterns of significant changes in the audio and visual signals are found and the correlation between such audio and visual temporal patterns is maximized to locate sounding pixels. (For example, see: Z. Barzelay, et al., “Harmony in motion,” Proc. IEEE Conference Computer Vision and Pattern Recognition, pp. 1-8, 2007.) Such joint audio-visual object tracking methods have shown interesting results in analyzing videos in a controlled or simple environment where good foreground/background separation can be obtained. However, both object detection and tracking (especially for unconstrained objects) are known to be difficult in generic videos. There usually exist uneven lighting, clutter, occlusions, and complicated motions of both multiple objects and camera. In addition, the basic assumption for tight audio-visual synchronization at the object level may not be valid in practice. Multiple objects may make sounds together in a video without large movements, and sometimes the objects making sounds do not show up in the video.