With the development of digital imaging and storage technologies, video clips can be conveniently captured by consumers using various devices such as camcorders, digital cameras or cell phones and stored for later viewing. However, efficient browsing, indexing and retrieval become big challenges for such massive visual data. Video summarization holds the promise to solve this problem by reducing the temporal redundancy and preserving only visually or semantically important parts of the original video.
Video summarization is an active research area and several approaches for generating a video summary from an input video have been proposed. For example, the method disclosed by Jeannin et al. in U.S. Pat. No. 7,333,712, entitled “Visual summary for scanning forwards and backwards in video content” first extracts key frames from the input video and assigns a set of weights to the extracted key frames. A visual summary is then generated by filtering the key frames according to the relative weights assigned to these key frames.
In U.S. Pat. No. 7,110,458, entitled “Method for summarizing a video using motion descriptors”, Divakaran et al. teach a method for forming a video summary that measures an intensity of motion activity in a compressed video and uses the intensity information to partition the video into segments. Key frames are then selected from each segment. The selected key frames are concatenated in temporal order to form a summary of the video.
Peng et al., in the article “Keyframe-based video summarization using visual attention clue” (IEEE Multimedia, Vol. 17, pp. 64-73, 2010), teach computing visual attention index (VAI) values for the frames of a video sequence. The frames with higher VAI values are selected as key frames. A video summary is generated by controlling the key frame density.
Another method taught by Wang et al., in the article “Video summarization by redundancy removing and content ranking” (Proceedings of the 15th International Conference on Multimedia, pp. 577-580, 2007), detects shot boundaries by color histogram and optical-flow motion features and extracts key frames in each shot by a leader-follower clustering algorithm. Then, a video summary is generated by key frame clustering and repetitive segment detection.
All of the above approaches for video summarization rely on identifying key frames. These approaches are limited because their performance depends on the accuracy of the underlying key frame extraction algorithms.
In U.S. Pat. No. 7,630,562, entitled “Method and system for segmentation, classification, and summarization of video images,” Gong et al., teaches mapping a feature representation of a sequence of video frames into a refined feature space using singular value decomposition. The information contained in each video shot is computed by using a metric in the refined feature space, which in turn is used to generate a summary video sequence. However, singular value decomposition is susceptible to the noise and the non-linearity present in the data.
U.S. Pat. No. 7,127,120 to Hua et al., entitled “Systems and methods for automatically editing a video,” teaches a sub-shots based method for video summarization. In this method, first sub-shots from a video are extracted and then a group of sub-shots are discarded using importance measures assigned to these sub-shots. A final video summary is generated by connecting the remaining sub-shots with respective transitions.
U.S. Pat. No. 6,751,776 to Gong, entitled “Method and apparatus for personalized multimedia summarization based upon user specified theme,” teaches an approach that uses both natural language processing and video analysis techniques to extract important keywords from the closed caption text as well as prominent visual features from the video footage. The extracted keywords and the visual features are used to summarize the video content that is able to create personalized multimedia summary based on the user-specified theme. But this approach is not suitable for the videos that do not contain closed caption text.
There remains a need for a video summary framework that is data adaptive, robust to noise and different content, and can be applied to wide varieties of videos.