With the development of digital imaging and storage technologies, video clips can be conveniently captured by consumers using various devices such as camcorders, digital cameras or cell phones and stored for later viewing and processing. Efficient content-aware video representation models are critical for many video analysis and processing applications including denoising, restoration, and semantic analysis.
Developing models to capture spatiotemporal information present in video data is an active research area and several approaches to represent video data content effectively have been proposed. For example, Cheung et al. in the article “Video epitomes” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 42-49, 2005), teach a patch-based probability models to represent video content. However, their model does not capture spatial correlation.
In the article “Recursive estimation of generative models of video” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 79-86, 2006), Petrovic et al. teach a generative model and learning procedure for unsupervised video clustering into scenes. However, they assume videos to have only one scene. Furthermore, their framework does not model local motion.
Peng et al., in the article “RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 763-770, 2010), teach a sparsity-based method for simultaneously aligning a batch of linearly correlated images. Clearly, this model is not suitable for video processing as video frames, in general, are not linearly correlated.
Key frame extraction algorithms are used to select a subset of the most informative frames from a video, with the goal of representing the most significant content of the video with a limited number of frames. Key frame extraction finds applications in several broad areas of video processing such as video summarization, creating “chapter titles” in DVDs, video indexing, and making prints from video. Key frame extraction is an active research area, and many approaches for extracting key frames from videos have been proposed.
Conventional key frame extraction approaches can be loosely divided into two groups: (i) shot-based, and (ii) segment-based. In shot-based key frame extraction, the shots of the original video are first detected, and one or more key frames are extracted for each shot (for example, see: Uchihashi et al., “Summarizing video using a shot importance measure and a frame-packing algorithm,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, pp. 3041-3044, 1999). In segment-based key frame extraction approaches, a video is segmented into higher-level video components, where each segment or component could be a scene, an event, a set of one or more shots, or even the entire video sequence. Representative frame(s) from each segment are then selected as the key frames (for example, see: Rasheed et al., “Detection and representation of scenes in videos,” IEEE Trans. Multimedia, Vol. 7, pp. 1097-1105, 2005).
Existing key frame selection approaches, both shot-based as well as segment-based, are usually suitable for structured videos such as news and sports videos. However, they are sub-optimal for consumer videos as these videos are typically captured in an unconstrained environment and record extremely diverse content. Moreover, consumer videos often lack a pre-imposed structure, which makes it even more challenging to detect shots or segment such videos for key frame extraction (see: Costello et al., “First-and third-party ground truth for key frame extraction from consumer video clips,” in Proc. SPIE 6492, pp. 64921N, 2007 and Luo et al., “Towards extracting semantically meaningful key frames from personal video clips: from humans to computers,” IEEE Trans. Circuits Syst. Video Technol., Vol. 19, pp. 289-301, 2009).
There remains a need for robust and efficient methods to process digital video sequences captured in an unconstrained environment to perform tasks such as identifying key frames, identifying scene boundaries and forming video summaries.