1. Field of the Invention
The subject invention relates to a system and method for video summarization, and more specifically to a system for segmenting and classifying data from a video in order to create a summary video.
2. History
Video cameras are becoming more prevalent as they become more inexpensive and are embedded in different technologies, including cell phones and digital cameras. As evidenced by the popularity of posting videos on the Internet, it has become easy for people to create and publish videos. With the increasing amount of video available, there is a corresponding increasing amount of unstructured video, that which hasn't been reviewed or classified. When a user needs to classify a large amount of video that is unknown or unstructured, the process of classifying the video is time-consuming. The user could simply manually view all the videos, but it takes time to load and watch each video in real time.
Additionally, raw video footage contains more footage than is desirable in the final viewing or for assessing the content in the video. This may take on the form of “bad shots” or too much of a scene with no action. Looking through a video to quickly assess whether it contains a shot of interest or covers a topic of interest is tedious. If the shot of interest is short, it may be missed if scrubbing the video is used, as the user may scrub quickly over what appears to be redundant video. Actions due to object movement or camera movement may also be missed when scrubbing too quickly.
There are a number of video summary systems that have been described in the literature.
In “Dynamic video summarization and visualization,” Proceedings of the seventh ACM international conference on Multimedia, J. Nam and A. H. Twefik, vol. 2, pages 53-56, 1999, Nam and Twefik describe creating extractive video summaries using adaptive nonlinear sampling and audio analysis to identify “two semantically meaningful events; emotional dialogue and violent featured action” to include in a summary. Their method is limited to specific types of videos and would not be appropriate to other genres, such as the majority of documentaries or educational videos.
Video summarization methods are also described in A. Divakaran, K. A. Peker, S. F. Chang, R. Radharkishnan, and L. Xie, “Video mining: Pattern discovery versus pattern recognition,” in Proc. IEEE International Conference on Image Processing (ICIP), volume 4, pages 2379-2382, October 2004, and A. Divakaran, K. A. Peker, R. Radharkishnan, Z. Xiong, and R. Cabasson, Video Mining, Chapter Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors. Kluwer Academic Publishers, 2003. These methods are genre dependent and not generally applicable to a variety of videos, such as travel videos where the amount of activity does not vary widely and the speech may be primarily a narration.
In Shingo Uchihashi and Jonathan Foote, “Summarizing video using a shot importance measure and a frame-packing algorithm,” in Proc. IEEE ICASSP, volume 6, pages 3041-3044, 1999, Uchihashi and Foote describe a method of measuring shot importance for creating video summaries. They do not analyze videos for camera motion, and so identification of shot repetition may not work well, as the similarity of repeated shots with camera motion is generally less than shots with a static camera.
IBM's Video Sue, by Belle Tseng, Ching-Yng Lin, and John R. Smith, “Video summarization and personalization for pervasive mobile devices,” in Proc. SPIE Electronic Imaging 2002—Storage and Retrieval for Media Databases, 2002, is a summarization system that is part of a video semantic summarization system. However, their methods either require user annotation or use a single sub-sampling rate to trivially create a summary without accounting for varying content.
In N. Peyrard and P. Bouthemy, “Motion-based selection of relevant video segments for video summarization,” Multimedia Tools and Applications, volume 26, pages 259-275, 2005, Peyrard and Bouthemy present a method for motion-based video segmentation and segment classification, for use in video summarization. However, the classifications are defined to match the video genre of ice skating. Thus Peyrard and Bouthemy are also limited to specific types of videos and are focused on the motion of objects.
In C. W. Ngo, Y. F. Ma, and H. J. Zhang, “Automatic video summarization by graph modeling,” Proc. Ninth IEEE International Conference on Computer Vision (ICCV'03), volume 1, page 104, 2003, Ngo et al. use a temporal graph that expresses the temporal relationship among clusters of shots. This method was developed for video where the shots are generally recorded in sequence, as in produced videos like a cartoon, commercial, or home video. The method assumes that each of the clusters can be grouped into scenes. It does not handle repetition of a scene, with intervening scenes. It also does not handle scenes with camera motion separately from those where the camera is static.
Finally, in Itheri Yahiaoui, Bernard Merialdo, and Benoit Huet, “Automatic Video Summarization, Multimedia Content-Based Indexing and Retrieval Workshop (MMCBIR),” 2001, Yahiaoui et al. present a method for frame-based summarization. Their method is limited to color-based analysis of individual video frames. Additionally, the method is computationally expensive because all frames are clustered.
Therefore, what is needed is a system for summarizing a video that can review a video, classify the content, and summarize the video while still preserving the relevant content.