Prediction Segmentation [Primary]
Conventional video compression, for example MPEG-4 and H.264, have the facilities for specifying a number of reference frames to use during the motion compensated prediction process in order to predict the current frame. These standards typically restrict the reference frames to one or more consecutive past frames, and in some cases any set of frames that has been previously decoded. Usually, there is a limit on the number of reference frames and also a limit on how far back in the stream of decoded frames the selection process may draw.
Compressed Sensing (CS)
Image and video compression techniques generally attempt to exploit redundancy in the data that allows the most important information in the data to be captured in a “small” number of parameters. “Small” is defined relative to the size of the original raw data. It is not known in advance which parameters will be important for a given data set. Because of this, conventional image/video compression techniques compute (or measure) a relatively large number of parameters before selecting those that will yield the most compact encoding. For example, the JPEG and JPEG 2000 image compression standards are based on linear transforms (typically the discrete cosine transform [DCT] or discrete wavelet transform [DWT]) that convert image pixels into transform coefficients, resulting in a number of transform coefficients equal to the number of original pixels. In transform space, the important coefficients can then be selected by various techniques. One example is scalar quantization. When taken to an extreme, this is equivalent to magnitude thresholding. While the DCT and DWT can be computed efficiently, the need to compute the full transform before data reduction causes inefficiency. The computation requires a number of measurements equal to the size of the input data for these two transforms. This characteristic of conventional image/video compression techniques makes them impractical for use when high computational efficiency is required.
Conventional compression allows for the blending of multiple matches from multiple frames to predict regions of the current frame. The blending is often linear, or a log scaled linear combination of the matches. One example of when this bi-prediction method is effective is when there is a fade from one image to another over time. The process of fading is a linear blending of two images, and the process can sometimes be effectively modeled using bi-prediction. Further, the MPEG-2 Interpolative mode allows for the interpolation of linear parameters to synthesize the bi-prediction model over many frames.
Conventional compression allows for the specification of one or more reference frames from which predictions for the encoding of the current frame can be drawn. While the reference frames are typically temporally adjacent to the current frame, there is also accommodation for the specification of reference frames from outside the set of the temporally adjacent frames.
In contrast with conventional transform-based image/video compression algorithms, compressed sensing (CS) algorithms directly exploit much of the redundancy in the data during the measurement (“sensing”) step. Redundancy in the temporal, spatial, and spectral domains is a major contributor to higher compression rates. The key result for all compressed sensing algorithms is that a compressible signal can be sensed with a relatively small number of random measurements and much smaller than the number required by conventional compression algorithms. The images can then be reconstructed accurately and reliably. Given known statistical characteristics, a subset of the visual information is used to infer the rest of the data.
The precise number of measurements required in a given CS algorithm depends on the type of signal as well as the “recovery algorithm” that reconstructs the signal from the measurements (coefficients). Note that the number of measurements required by a CS algorithm to reconstruct signals with some certainty is not directly related to the computational complexity of the algorithm. For example, a class of CS algorithms that uses L1-minimization to recover the signal requires a relatively small number of measurements, but the L1-minimization algorithm is very slow (not real-time). Thus, practical compressed sensing algorithms seek to balance the number of required measurements with the accuracy of the reconstruction and with computational complexity. CS provides a radically different model of codec design compared to conventional codecs.
In general, there are three major steps in a typical CS algorithm: (1) create the measurement matrix M; (2) take measurements of the data using the measurement matrix, also known as creating an encoding of the data; and (3) recover the original data from the encoding, also known as the decoding step. The recovery algorithm (decoder) can be complex, and because there are fewer limits to computational power at the receiver, the overall CS algorithm is usually named after its decoder. There are three practical applications of CS algorithms of interest in the prior art: Orthogonal Matching Pursuit (OMP), L1 Minimization (L1M), and Chaining Pursuit (CP). In general, the L1M in practice is prohibitively computationally inefficient for most video processing applications. The more efficient OMP and CP algorithms provide much of the same benefits of the L1M, and, as such, they are the two CS algorithms of choice for most applications of the L1M.
Image Alignment Via Inverse Compositional Algorithm
Basri and Jacobs (“Lambertian Reflectances and Linear Subspaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, February 2003), henceforth referred to as LRLS, have shown that Lambertian objects (whose surfaces reflect light in all directions) can be well-approximated by a small (9-dimensional) linear subspace of LRLS “basis images” based on spherical harmonic functions. The LRLS basis images can be visualized as versions of the object under different lighting conditions and textures. The LRLS basis images thus depend on the structure of the object (through its surface normals), the albedo of the object at its different reflection points, and the illumination model (which follows Lambert's cosine law, integrated over direction, to produce spherical harmonic functions). Under the assumptions of the model, the 9-D subspace captures more than 99% of the energy intensity in the object image. The low dimensionality of the appearance subspace indicates a greater redundancy in the data than is available to conventional compression schemes.
The inverse compositional algorithm (IC) was first proposed as an efficient implementation of the Lucas-Kanade algorithm for 2D motion estimation and image registration. Subsequent implementations have used the IC algorithm to fit 3D models such as Active Appearance Models and the 3D morphable model (3DMM) to face images.
Application of Incremental Singular Value Decomposition (ISVD) Algorithm
A common dimensionality reduction technique involves the utilization of linear transformations on norm preserving bases. Reduction of an SVD representation refers to the deletion of certain singular value/singular vector pairs in the SVD to produce a more computationally and representationally efficient representation of the data. Most commonly, the SVD factorization is effectively reduced by zeroing all singular values below a certain threshold and deleting the corresponding singular vectors. This magnitude thresholding results in a reduced SVD with r singular values (r<N) that is the best r-dimensional approximation of the data matrix D from an L2-norm perspective. The reduced SVD is given by{circumflex over (D)}=UTSTVTT,  Equation 1
where Ur is M×r, Sr is r×r diagonal, and Vr is N×r.
The singular value decomposition (SVD) is a factorization of a data matrix that leads naturally to minimal (compact) descriptions of the data. Given a data matrix D of size M×N, the SVD factorization is given by D=U*S*V′ where U is an M×N column-orthogonal matrix of (left) singular vectors, S is an N×N diagonal matrix with singular values (s1, s2, . . . sN) along the diagonal, and V is an N×N orthogonal matrix of (right) singular vectors.
Compact Manifold Prediction
Matching pursuit (MP) is an iterative algorithm for deriving efficient signal representations. Given the problem of representing a signal vector s in terms of a dictionary D of basis functions (not necessarily orthogonal), MP selects functions for the representation via the iterative process described here. The first basis function in the representation (denoted as d1) is selected as the one having maximum correlation with the signal vector. Next, a residual vector r1 is computed by subtracting the projection of d1 onto the signal from the signal itself: r1=s−(d1′*s)*d1. Then, the next function in the representation (d2) is selected as the one having maximum correlation with the residual r1. The projection of d2 onto r1 is subtracted from r1 to form another residual r2. The same process is then repeated until the norm of the residual falls below a certain threshold.
Orthogonal matching pursuit (OMP) follows the same iterative procedure as MP, except that an extra step is taken to ensure that the residual is orthogonal to every function already in the representation ensemble. While the OMP recursion is more complicated than in MP, the extra computations ensure that OMP converges to a solution in no more than Nd steps, where Nd is the number of functions in the dictionary D.