In order to summarize, browse and index a video, it is necessary to detect and identify structures and events in the video. The structures represent a syntactic composition of the video, and the events represent occurrences of semantic concepts in the video, which are consistent with the structures.
For example, at a lowest level, the structures can be indicated by repeated color schemes, texture patterns, or motion. At a mid level, the structure can be based on repeated camera movement, for example a pans, followed by a close-up. At a high level, the structures can relate to specific state transitions in the video. For example, in a golf video, a tee shot is usually followed by pan following the ball flying through the air until it lands and rolls on the fairway.
The problem of identifying structure has two main parts: finding a description of the structure, i.e., a model, and locating segments in the video that matches the description. Most prior art methods perform these two tasks in separate steps. The former is usually referred to as training, while the latter, is called classification or segmentation.
One possible way to represent the structures is with hidden Markov models (HMMs), see Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, Vol. 77, pp. 257-285, February 1989. HMMs are stochastic models with a discrete state-space. HMMs work well for temporally correlated signals, such as videos. HMMs have been successfully applied in many different applications, such as speech recognition, handwriting recognition, and motion analysis in videos.
For videos, different genres in TV programs have been distinguished with HMMs trained for each genre, see Wang et al., “Multimedia content analysis using both audio and visual clues,” IEEE Signal Processing Magazine, Vol. 17, pp. 12-36, November 2000. The high-level structure of soccer games, e.g., play versus break, was delineated with a pool of HMMs trained for each category, see Xie et al., “Structure analysis of soccer video with hidden Markov models,” Proc. International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2002, U.S. Pat. No. 5,828,809 issued to Chang et al. on Oct. 27, 1998, “Method and apparatus for extracting indexing information from digital video data,” where a football game is analyzed.
All of the above methods use what is known as supervised learning. There, important aspects and constraints of the structures and events, if not the structure and events themselves are explicitly identified, and training videos are labeled accordingly to these preconceived notions for the training and classification. That methodology is adequate for specific video genres, at a small scale. However, such methods cannot be extended to the more general case at a large scale.
Therefore, it is desired to use unsupervised learning techniques that can automatically determine salient structures and events in an unlabeled video, without prior knowledge of the genre of the video.
Unsupervised learning has been applied to gene motif discovery and data mining, see Xie et al., “Learning hierarchical hidden Markov models for video structure discovery,” Tech. Rep. 2002-006, ADVENT Group, Columbia University, 2002, December 2002, and U.S. patent application Ser. No. 20030103565, Xie et al., “Structural analysis of videos with hidden Markov models and dynamic programming,” filed Jun. 5, 2003.
Clustering techniques have been applied to key frames of shots to discover the story units in a TV drama. However, temporal dependencies of the video were not formally modeled, see Yeung et al., “Time-constrained clustering for segmentation of video into story units,” Proceedings International Conference on Pattern Recognition (ICPR), 1996.
Left-to-right HMMs have been stacked into a large HMM in order to model temporally evolving events in videos, see Clarkson et al., “Unsupervised clustering of ambulatory audio and video,” International Conference on Acoustic, Speech and Signal Processing (ICASSP), 1999, and Naphade et al., “Discovering recurrent events in video using unsupervised methods,” Proc. Intl. Conf. Image Processing, 2002.
Therefore, there is a need for a method for automatically determining a structural model of a video, and detecting semantic events in the video that are consistent with the model.