As digital video becomes more pervasive, efficient ways of analyzing the content of videos become necessary and important. Videos contain a huge amount of data and complexity that make the analysis very difficult. The first and most important analysis is to understand high-level structures of videos, which can provide the basis for further detailed analysis.
A number of analysis methods are known, see Yeung et al. “Video Browsing using is Clustering and Scene Transitions on Compressed Sequences,” Multimedia Computing and Networking 1995, Vol. SPIE 2417, pp. 399-413, February 1995, Yeung et al. “Time-constrained Clustering for Segmentation of Video into Story Units,” ICPR, Vol. C. pp. 375-380 August 1996, Zhong et al. “Clustering Methods for Video Browsing and Annotation,” SPIE Conference on Storage and Retrieval for Image and Video Databases, Vol. 2670, February 1996, Chen et al., “ViBE: A New Paradigm for Video Database Browsing and Search,” Proc. IEEE Workshop on Content-Based Access of Image and Video Databases, 1998, and Gong et al., “Automatic Parsing of TV Soccer Programs,” Proceedings of the International Conference on Multimedia Computing and systems (ICMCS), May 1995.
Gong et al. describes a system that used domain knowledge and domain specific models in parsing the structure of a soccer video. Like other prior art systems, a video is first segmented into shots. A shot is defined as all frames between a shutter opening and closing. Spatial features (playing field lines) extracted from frames within each shot are used to classify each shot into different categories, e.g., penalty area, midfield, corner area, corner kick, and shot at goal. Note that that work relies heavily on accurate segmentation of video into shots before features are extracted. That method also requires an uncompressed video.
Zhong et al. also described a system for analyzing sport videos. That system detects boundaries of high-level semantic units, e.g., pitching in baseball and serving in tennis. Each semantic unit is further analyzed to extract interesting events, e.g., number of strokes, type of plays—returns into the net or baseline returns in tennis. A color-based adaptive filtering method is applied to a key frame of each shot to detect specific views. Complex features, such as edges and moving objects, are used to verify and refine the detection results. Note that that work also relies heavily on accurate segmentation of the video into shots prior to feature extraction. In short, both Gong and Zhong consider the video to be a concatenation of basic units, where each unit is a shot. The resolution of the feature analysis does not go finer than the shot level.
Thus, generally the prior art is as follows: first the video is segmented into shots. Then, key frames are extracted from each shot, and grouped into scenes. A scene transition graph and hierarchy tree are used to represent these data structures. The problem with those approaches is the mismatch between the low-level shot information, and the high-level scene information. Those only work when interesting content changes correspond to the shot changes.
In many applications such as soccer videos, interesting events such as “plays” cannot be defined by shot changes. Each play may contain multiple shots that have similar color distributions. Transitions between plays are hard to find by a simple frame clustering based on just shot features.
In many situations, where there is substantial camera motion, shot detection processes tend to segment erroneously because this type of segmentation is from low-level features without considering the domain specific high-level syntax and content model of the video. Thus, it is difficult to bridge the gap between low-level features and high-level features based on shot-level segmentation. Moreover, too much information is lost during the shot segmentation process.
Videos in different domains have very different characteristics and structures. Domain knowledge can greatly facilitate the analysis process. For example, in sports videos, there are usually a fixed number of cameras, views, camera control rules, and a transition syntax imposed by the rules of the game, e.g., play-by-play in soccer, serve-by-serve in tennis, and inning-by-inning in baseball.
Tan et al. in “Rapid estimation of camera motion from compressed video with application to video annotation,” IEEE Trans. on Circuits and Systems for Video Technology, 1999, and Zhang et al. in “Automatic Parsing and Indexing of News Video,” Multimedia Systems, Vol. 2, pp. 256-266, 1995, described video analysis for news and baseball. But very few systems consider high-level structure in more complex videos such as a soccer video.
The problem is that a soccer game has a relatively loose structure compared to other videos like news and baseball. Except the play-by-play structure, the content flow can be quite unpredictable and happen randomly. There is a lot of motion, and view changes in a video of a soccer game. Solving this problem is useful for automatic content filtering for soccer fans and professionals.
The problem is more interesting in the broader background of video structure analysis and content understanding. With respect to structure, the primary concern is the temporal sequence of high-level video states, for example, the game states play and break in a soccer game. It is desired to automatically parse a continuous video stream into an alternating sequence of these two game states.
Prior art structural analysis methods mostly focus on the detection of domain specific events. Parsing structures separately from event detection has the following advantages. Typically, no more than 60% of content corresponds to play. Thus, one could achieve significant information reduction by segmenting out portions of the video that correspond to break. Also, content characteristics in play and break are different, thus one could optimize event detectors with such prior state knowledge.
Related art structural analysis work pertains mostly to sports video analysis, including soccer and various other games, and general video segmentation. For soccer video, prior work has been on shot classification, see Gong above, scene reconstruction, Yow et al., “Analysis and Presentation of Soccer Highlights from Digital Video,” Proc. ACCV, 1995, December 1995, and rule-based semantic classification of Tovinkere et al., “Detecting Semantic Events in Soccer Games: Towards A Complete Solution,” Proc. ICME 2001, August 2001.
For other sports video, supervised learning has been used to recognize canonical views such as baseball pitching and tennis serve, see Zhong et al., “Structure Analysis of Sports Video Using Domain Models,” Proc. ICME 2001, August 2001.
Hidden Markov models (HMM) have been used for general video classification and for distinguishing different types of programs, such as news, commercial, etc, see Huang et al., “Joint video scene segmentation and classification based on hidden Markov model,” Proc. ICME 2000, pp. 1551-1554 Vol.3, July 2000.
Heuristic rules based on domain specific features and dominant color ratios, have also been used to segment play and break, see Xu et al., “Algorithms and system for segmentation and structure analysis in soccer video,” Proc. ICME 2001, August 2001, and U.S. patent application Ser. No. 09/839,924 “Method and System for High-Level Structure Analysis and Event Detection in Domain Specific Videos,” filed by Xu et al. on Apr. 20, 2001. However, variations in these features are hard to quantify with explicit low-level decision rules.
Therefore, there is a need for a framework where all the information of low-level features of a video are retained, and the feature sequences are better represented. Then, it can become possible to incorporate a domain specific syntax and content models to identify high-level structure to enable video classification and segmentation.