With the development of digital imaging and storage technologies, video clips can be conveniently captured by consumers using various devices such as camcorders, digital cameras or cell phones and stored for later viewing and processing. Efficient content-aware video representation models are critical for many video analysis and processing applications including denoising, restoration, and semantic analysis.
Developing models to capture spatiotemporal information present in video data is an active research area and several approaches to represent video data content effectively have been proposed. For example, Cheung et al. in the article “Video epitomes” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 42-49, 2005), teach using patch-based probability models to represent video content. However, their model does not capture spatial correlation.
In the article “Recursive estimation of generative models of video” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 79-86, 2006), Petrovic et al. teach a generative model and learning procedure for unsupervised video clustering into scenes. However, they assume videos to have only one scene. Furthermore, their framework does not model local motion.
Peng et al., in the article “RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 763-770, 2010), teach a sparsity-based method for simultaneously aligning a batch of linearly correlated images. Clearly, this model is not suitable for video processing as video frames, in general, are not linearly correlated.
Another method taught by Baron et al., in the article “Distributed compressive sensing” (preprint, 2005), models both intra- and inter-signal correlation structures for distributed coding algorithms.
In the article “Compressive acquisition of dynamic scenes” (Proc. 11th European Conference on Computer Vision, pp. 129-142, 2010), Sankaranarayanan et al. teach a compressed sensing-based model for capturing video data at much lower rate than the Nyquist frequency. However, this model works only for single scene video.
In the article “A compressive sensing approach for expression-invariant face recognition” (Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1518-1525, 2009), Nagesh et al. teaches a face recognition algorithm based on the theory of compressed sensing. Given a set of registered training face images from one person, their algorithm estimates a common image and a series of innovation images. The innovation images are further exploited for face recognition. However, this algorithm is not suitable for video modeling as it was designed explicitly for face recognition and does not preserve pixel-level information.
There remains a need for a video representation framework that is data adaptive, robust to noise and different content, and can be applied to wide varieties of videos including reconstruction, denoising, and semantic understanding.