In the prior art, video summarization and adaptive playback of videos are often perceived as one and the same. Therefore, to distinguish the invention, the following definitions are provided.
Video Summarization
Video summarization is a process that generates the gist or main points of video content in a reduced and compact form. In general, video summaries are generated by selecting a subset of frames from the original video to produce a summary video that is shorter video than the original video. A summary can include selected still frames and/or short selected continuous sequences to convey the essence of the original video. The summary can be presented in the order of the selected frames, as a story board, or as a mosaic. It is also possible to summarize a video textually or verbally.
In general, video summarization is based on user input and video content. The analysis of the content can be based on low-level features such as texture, motion, color, contrast, luminance, etc., and high-level semantic features such genre, dramatic intensity, humor, action level, beauty, lyricism, etc.
Adaptive Playback
Adaptive playback is a process that presents a video in a time-warped manner. In the most general sense, the video play speed is selectively increased or decreased by changing the frame rate, or by selectively dropping frames to increase the play speed, or adding frames to decrease the play speed. If the adaptive playback of a video is shorter than the original video and the playback conveys the essence of the content of the video, then it can be considered as a type of summary. However, there are cases where the adaptive playback of a video is longer than the original video. For example, if the video contains a complex scene or a lot of motion, then playing the video at a slower speed can provide the viewer with a better sense of the details of the video. That type of adaptive playback is an amplification or augmentation of the video, rather than a summary.
The main purpose of a summary is to output the essence of the video in a shorter amount of time, and therefore the process is basically grounded on content analysis.
In contrast, the main purpose of adaptive playback is to improve the perception of the video to the human visual system, where the improvement is based on the video's visual complexity. Therefore, the focus of the adaptation is based more on psycho-physical characteristics of the video rather than content, and the process is more of a presentation technique, than a content analysis method.
Automatic video summarization methods are well known, see S. Pfeiffer et al. in “Abstracting Digital Movies Automatically,” J. Visual Comm. Image Representation, vol. 7, no. 4, pp. 345-353, December 1996, and Hanjalic et al. in “An Integrated Scheme for Automated Video Abstraction Based on Unsupervised Cluster-Validity Analysis,” IEEE Trans. On Circuits and Systems for Video Technology, Vol. 9, No. 8, December 1999.
Most known video summarization methods focus on color-based summarization. Pfeiffer et al. also uses motion, in combination with other features, to generate video summaries. However, their approach merely uses a weighted combination that overlooks possible correlation between the combined features. While color descriptors are reliable, they do not include the motion characteristics of video content. However, motion descriptors tend to be more sensitive to noise than color descriptors. The level of motion activity in a video can be a measure of how much the scene acquired by the video is changing. Therefore, the motion activity can be considered a measure of the “summarizability” of the video. For instance, a high speed car chase will certainly have many more “changes” in it compared to a scene of a news-caster, and thus, the high speed car chase scene will require more resources for a visual summary than would the news-caster scene.
In some sense, summarization can be viewed as a reduction in redundancy. This can be done by clustering similar video frames, and selecting representative frames from the from clusters, see Yeung et al., “Efficient matching and clustering of video shots,” ICIP '95, pp. 338-341, 1995, Zhong et al., “Clustering methods for video browsing and annotation,” SPIE Storage and Retrieval for Image and Video Databases IV, pp. 239-246, 1996, and Ferman et al., “Efficient filtering and clustering methods for temporal video segmentation and visual summarization,” J. Vis. Commun. & Image Rep., 9:336-351, 1998.
In another approach, changes in the video content are measured over time, and representative frames are then selected whenever the changes become significant, see DeMenthon et al., “Video Summarization by Curve Simplification,” ACM Multimedia 98, pp. 211-218, September 1998, and Divakaran et al., “Motion Activity based extraction of key frames from video shots,” Proc. IEEE Int'l Conf. on Image Processing, September 2002.
In yet another approach, a significance measure is assigning to the different parts of the video. Subsequently, less significant parts can be filtered, see Ma et al., “A User Attention Model for Video Summarization,” ACM Multimedia '02, pp. 533-542, December 2002.
An adaptive video summarization method is described by Divakaran et al., “Video summarization using descriptors of motion activity,” Journal of Electronic Imaging, Vol. 10, No. 4, October 2001, and Peker et al., “Constant pace skimming and temporal sub-sampling of video using motion activity,” Proc. IEEE Int'l Conf. on Image Processing, October 2001, U.S. patent application Ser. No. 09/715,639, filed by Peker et al., on Nov. 17, 2000, and U.S. patent application Ser. No. 09/654,364 filed Aug. 9, 2000 by Divakaran et al, incorporated herein by reference. There, a motion activity descriptor is used to generate a summary that has a constant ‘pace’. The motion activity descriptor is an average magnitude of the motion vectors in an MPEG compressed video.
The prior art video processing methods have mainly focused on providing comprehensible summaries considering the content. However, different methods are required to adaptively play videos at different speeds according to visual complexity. These methods should consider how fast the human eye can follow the flow of action as a function of spatial and temporal complexity.