The invention relates generally to the field of video analysis, and more particularly to analyzing domain specific videos.
As digital video becomes more pervasive, efficient ways of analyzing the content of videos become necessary and important. Videos contain a huge amount of data and complexity that make the analysis very difficult. The first and most important analysis is to understand the structure of the video, which can provide the basis for further detailed analysis.
A number of analysis methods are known, see M. M. Yeung, B. L. Yeo, W. Wolf, and B. Liu, xe2x80x9cVideo Browsing using Clustering and Scene Transitions on Compressed Sequences,xe2x80x9d Multimedia Computing and Networking 1995, Vol. SPIE 2417, pp. 399-413, February 1995, M. J. Yeung and B. L. Yeo, xe2x80x9cTime-constrained Clustering for Segmentation of Video into Story Units, ICPR, Vol. C. pp. 375-380 August 1996, D. Zhong, H. J. Zhang and S. F. Chang, xe2x80x9cClustering Methods for Video Browsing and Annotation,xe2x80x9d SPIE Conference on Storage and Retrieval for Image and Video Databases, Vol. 2670, February 1996, J. Y. Chen, C. Taskiran, E. J. Delp and C. A. Bouman, xe2x80x9cViBE: A New Paradigm for Video Database Browsing and Search. In Proc. IEEE Workshop on Content-Based Access of Image and Video Databases, 1998, and Gong, Sin, Chuan, Zhang and Sakauchi, xe2x80x9cAutomatic Parsing of TV Soccer Programs,xe2x80x9d Proceedings of the International Conference on Multimedia Computing and systems (ICMCS), May 1995.
Gong et al. describe a system that uses domain knowledge and domain-specific models in parsing the structure of a soccer video. Like other prior art systems, a video is segmented first into shots. Video features extracted from frames within each shot are used to classify each shot into different categories, e.g., penalty area, midfield, corner area, corner kick, and shot at goal. Note that work relies heavily on accurate segmentation of video into shots before features are extracted.
Zhong et al. also describe a system for analyzing sport videos. The system provides detects boundaries of high-level semantic units, e.g., pitching in baseball and serving in tennis. Each semantic unit is further analyzed to extract interesting events, e.g., number of strokes, type of playsxe2x80x94returns into the net or baseline returns in tennis. A color-based adaptive filtering method is applied to a key frame of each shot to detect specific views. Complex features, such as edges and moving objects, are used to verify and refine the detection results. Note that this work also relies heavily on accurate segmentation of the video into shots prior to feature extraction. In short, both Gong and Zhong consider the video to be a concatenation of basic units, where each unit is a shot. The resolution of the feature analysis does not go finer than the shot level.
Thus, generally the prior art is as follows: first the video is segmented into shots. Then, key frames are extracted from each shot, and grouped into scenes. The scene transition graph and hierarchy tree are used to represent these data structures. The problem with those approaches is the mismatch between the low-level shot information, and the high-level scene information. They can only work when interesting content changes correspond to the shot changes. In many applications such as soccer videos, interesting events such as xe2x80x9cplaysxe2x80x9d cannot be defined by shot changes. Each play may contain multiple shots that have similar color distributions. Transitions between plays are hard to find by simple clustering of shot features.
In many situations, when the camera has a lot of motion, shot detection processes tend to have many false alarms because this type of segmentation is from low-level features without considering the domain-specific syntax and content model of the video. Thus, it is difficult to bridge the gap between low-level features and high-level features based on shot-level segmentation. Moreover, too much information is lost during the shot segmentation process.
Videos in different domains have very different characteristics and structures. Domain knowledge can greatly facilitate the analysis process. For example, in sports videos, there are usually a fixed number of cameras, views, camera control rules, and transition syntax imposed by the rules of the game, e.g., play-by-play in soccer, serve-by-serve in tennis, and inning-by-inning in baseball.
Y. P. Tan, D. D. Saur, S. R. Kulkami and P. J. Ramadge in xe2x80x9cRapid estimation of camera motion from compressed video with application to video annotation,xe2x80x9d IEEE Trans. on Circuits and Systems for Video Technology, 1999, and H. J. Zhang, S. Y. Tan, S. W. Smoliar and Y. H. Gong, in xe2x80x9cAutomatic Parsing and Indexing of News Video,xe2x80x9d Multimedia Systems, Vol. 2, pp. 256-266, 1995, describe video analysis for news and baseball. But very few systems consider high-level structure in more complex videos such as a soccer video.
The problem is that a soccer game has a relatively loose structure compared to other videos like news and baseball. Except the play-by-play structure, the content flow can be quite unpredictable and happen randomly. There are a lot of motion and view changes in soccer.
Therefore, there is a need for a framework where all the information of low-level features of a video are retained, and the feature sequences better represented. Then, it can become possible to incorporate a domain-specific syntax and a content model, and high level structure to enable event detection, and statistical analysis.
The invention provides a general framework for video structure discovery and content analysis. In the method and system according to the invention, frame-based low-level features are extracted from a video. Each frame is represented by the values of features or labels converted from the features to convert the video to multiple label sequences or real number sequences. Each of such sequences is associated with one of the extracted low-level feature. The feature sequences are analyzed together to extract high-level semantic features.
The invention can be applied to videos of sport activities, such as soccer games to index and summarize the video. The invention uses a distinctive feature to capture the high-level structure of the soccer video, e.g., activity boundaries, and use a unique feature, e.g., grass orientation, together with camera motion to detect interesting events such as game strategy. The unique aspects of the system include compressed-domain feature extraction for real-time performance, use of domain specific features for detecting high-level events, and integration of multiple features for content understanding.
Particularly, the system and method according to the invention analyzes a compressed video including a sequence of frames. The amount of a dominant feature in each frame of the compressed video is measured. A label is associated with each frame according the measured amount of the dominant feature. Views in the video are identified according to the labels, and the video is segmented into actions according to the views. The video can then be analyzed according to the action to determine significant events in the video.
The dominant feature, labels, views, action, and significant events are stored in a domain knowledge database. In one embodiment, the dominant feature is color, and a color histogram is constructed to identify the dominant feature.