1. Field of the Invention
This invention relates to systems and methods for selecting subsequences of video sequences, such as key frames, and more particularly to methods for selecting the subsequence of video frames to optimize a particular criterion of the video.
2. Background
Key frames are typically defined to be an unordered subset of video frames that represent the visual content of a video. Key frames are of useful in video indexing, summarization, content retrieval, and browsing. There are at least three main approaches to video key frame selection. In the first, video key frame selection is based on video segmentation, and is often used for highly edited commercial videos (see, e.g., Tat-Seng Chua and Li-Quan Ruan, “A Video Retrieval and Sequencing System,” ACM Transactions on Multimedia Systems,” pp. 373-407, 1995). A drawback of this first approach is that the results depend on the accuracy of video segmentation (as discussed in greater detail in, e.g., J. Boreczky and L. Rowe, “Comparison of Video Shot Boundary Detection Techniques,” Storage and Retrieval for Still Image and Video Databases, pages 170-179, 1996). Therefore, it is not an optimal technique for semi-edited (e.g., instructional) videos, unedited (e.g., home) videos, or extended single-shot (e.g., surveillance) videos.
The second approach uses clustering techniques based on a definition of “far enough” frames (see, e.g., Andreas Girgensohn and John Boreczky, “Time-Constrained Keyframe Selection Technique,” IEEE International Conference on Multimedia Computing and Systems, pages 756-761, 1999; M. Yeung and B. Liu, “Efficient Matching and Clustering of Video Shots,” Proceedings of the International Conference on Image Processing, pages 338-341, 1995; M. Yeung and B. L. Yeo, “Time-Constrained Clustering for Segmentation of Video into Story Units,” International Conference on Pattern Recognition, pages 375-380, 1996; and Yueting Zhuang, Yong Rui, Thomas S. Huang, and Sharad Mehrotra, “Adaptive Key Frame Extraction Using Unsupervised Clustering,” IEEE International Conference on Image Processing, pages 866-870, 1998). But an inherent drawback in this second approach is the choosing of appropriate thresholds. Although adaptive clustering methods may manipulate the threshold to produce a pre-designed number of key frames, this iterative searching process may make these methods computationally expensive.
The third approach converts a key frame selection problem to a problem of searching for the minimum cover of a set of key frames, based on the definition of a semi-Hausdorff distance function (see, e.g., H. S. Chang, S. Sull, and Sang Uk Lee, “Efficient Video Indexing Scheme for Content-based Retrieval,” IEEE Trans. on Circuits and Systems for Video Technology, pages 1269-1279, December 1999). But this search can be shown to be NP-hard, i.e., an algorithm for solving it can be translated into one for solving any other nondeterministic polynomial time problem, and the O(n2) greedy algorithm approximation to it is computationally expensive. Additionally, in both this and the previous approaches, the frames are chosen without regard to their temporal order, although such temporal relations may be very important in video summarization, streaming and compression.
Another aspect of prior research on key frames concerns the level of frame sampling. Sparse sampling, which is the selection of about one frame per shot or per scene, is usually used in video content summarization and indexing. Dense sampling, which chooses much more than one frame per shot or per scene, is more useful for video streaming, particularly in network environments where frame-dropping decisions are made in real time according to dynamic changes of network bandwidths or user requirements. Most of prior research in this area appears to concentrate on sparse video frame sampling (see, e.g., Edoardo Ardizzone and Mohand-Said Hacid, “A Semantic Modeling Approach for Video Retrieval by Content,” IEEE International Conference on Multimedia Computing and Systems, pages 158-162, 1999; Madirakshi Das and Shih-Ping Liou, “A New Hybrid Approach to Video Organization for Content-Based Indexing,” IEEE International Conference on Multimedia Computing and Systems, 1998; F. Idris and S. Panchanathan, “Review of Image and Video Indexing Techniques,” Journal of Visual Communication and Image Representation, pages 146-166, June 1997; Jia-Ling Koh, Chin-Sung Lee, and Arbee L. P. Chen, “Semantic Video Model for Content-based Retrieval, IEEE International Conference on Multimedia Computing and Systems, pages 472-478, 1999; and M. K. Mandal, F. Idris, and S. Panchanathan, “A Critical Evaluation of Image and Video Indexing Techniques in Compressed Domain,” Image and Vision Computing, pages 513-529, 1999). Although some other work, for example, the minimum-cover method (see, e.g., H. S. Chang, S. Sull, and Sang Uk Lee. Efficient
Video Indexing Scheme for Content-based Retrieval. In IEEE Trans. on Circuits and Systems for Video Technology, pages 1269-1279, December 1999), can be applied to dense sampling, generally their complexity make them unsuitable in use for extended or real-time videos.
Accordingly, there is a need in the art for a method for selecting a subsequence of video frames from a sequence of video frames which overcomes the drawbacks of the prior art as discussed above.