Consumers and organizations are inundated with billions of hours of video footage every day, potentially containing events, people, and objects of context-dependent and time-space-sensitive interests. However, even to the creators and owners of the video data, and to the people who are granted access for various purposes, the content in these videos remains unindexed, unstructured, unsearchable and unusable. Watching the recorded footage in real-time, or even playing it at 2× or 4× speed is tedious. It is no surprise that, with this increasing body of unstructured video data, information contained therein is nearly impossible to efficiently access unless it has already been seen and indexed, an undertaking which would be tedious and time consuming for a human, but an ideal challenge for machine intelligence.
Previous research on video summarization and abstraction has mainly focused on edited videos, e.g., movies, news, and sports, which are highly structured. For example, a movie could be naturally divided into scenes, each formed by one or more shots taking place at the same site, where each shot is further composed of frames with smooth and continuous motions. However, consumer generated videos and surveillance videos lack such structure, often rendering previous research not directly applicable.
Key frame based methods compose a video summary as a collection of salient images (key frames) picked from the original video. Various strategies have been studied, including shot boundary detection, color histogram analysis, motion stability, clustering, curve splitting and frame self-expressiveness. However, isolated and uncorrelated still images, without smooth temporal continuation, are not best suited to help the viewer understand the original video. Moreover, one prior art method proposes a saliency based method, which trains a linear regression model to predict importance score for each frame in egocentric videos. However, special features designed in that method limit its applicability only to videos generated by wearable cameras.
Besides picking frames from the original video, methods creating new images not present in the original video have also been studied, where a panoramic image is generated from a few consecutive frames determined to have important content. However, the number of consecutive frames from the original video used to construct such a panoramic image is limited by occlusion between objects from different frames. Consequently, these approaches generally produce short clips with few objects.
Finally, summaries composed by a collection of video segments, have been studied for edited videos. Specifically, one prior art method uses scene boundary detection, dialogue analysis, and color histogram analysis to produce a trailer for a feature film. Other prior art methods extract important segments from sports and news programs utilizing special characteristics of these videos, including fixed scene structures, dominant locations, and backgrounds. Still other methods utilize closed caption and speech recognition to transform video summarization into a text summarization problem and generate summaries using natural language processing techniques. However, the large body of consumer generated videos and surveillance videos typically have no such special structure, and often, especially with surveillance video, audio information is absent.