§2.1 Field of the Invention
The present invention concerns video coding and decoding. More specifically, the present invention concerns reducing the amount of data needed to code video in compressed form.
§2.2 Background Information
§2.2.1 Conventional Video Encoders and Decoders (Codec's)
Standard video coders are based on the prediction plus transform representation of an image block, which predicts the current block using various intra- and inter-prediction modes and then represents the prediction error using a fixed orthonormal transform. More specifically, FIG. 1 is a block diagram of a conventional encoder 100, such as an encoder that complies with the H.264 standard. The conventional encoder 100 operates on a current video frame 110 and one or more previously coded video frame(s) 125 and outputs a coded bitstream 155. The conventional encoder 100 includes an intra- and/or inter-frame based prediction unit 120, combiners 135 and 170, a transform and quantize unit 145, an entropy encode unit 150. The corresponding decoder includes a entropy decode unit 150 and an inverse transform or inverse quantize (or more specifically, inverse transform and rescale) unit 160.
Still referring to FIG. 1, a current input frame 110 is received for encoding. The frame 110 is processed in units of a macroblock (“MB”) 115 (e.g., 16×16 pixels in the original image). Each macroblock is encoded in intra-mode (using information from the current frame 110) or inter-mode (using information from one or more previously coded frame(s) 125. In either case, the intra- and/or inter-frame based prediction unit 120 generates a prediction MB 130. The combiner 135 subtracts the prediction MB 130 from the current MB 115 to generate a residual or difference MB 140. The transform and quantize unit 145 transforms the residual MB 140 (e.g., using a block transform) and quantizes the result to generate a set of quantized transform coefficients. The entropy encode unit 150 entropy encodes these (e.g., re-ordered coefficients). The entropy-encoded coefficients, together with ancillary information required to decode the MB (such as parameters defining the macroblock prediction mode, quantizer step size, motion vector information describing how the macroblock was motion-compensated, etc.) form the coded bitstream 155. The amount of bits in the coded bitstream 155 is much less than that in the current frame 110 that was coded.
Recall that one or more previously coded frame(s) 125 may be used. Such frame(s) 125 may be generated as follows. The inverse transform and inverse quantize (rescale) unit 160 may be used to rescale and inverse transform the coefficients to generate a decoded residual MB 165. Note that the decoded residual MB 165 will not be identical to the original residual MB 140 because information is lost in the quantization process. Combiner 170 adds the prediction MB 130 to the decoded residual MB 165 to generate a reconstructed MB. A filter may be applied to reduce the effects of blocking distortion and a previously coded frame 125 is created from a series of the reconstructed MBs.
Referring back to unit 120, finding a suitable inter-frame prediction is often referred to as “motion estimation”. Subtracting an inter-frame prediction from the current macroblock is often referred to as “motion compensation.”
Referring back to unit 145, a block of residual samples may be transformed using the 4×4 or 8×8 Discrete Cosine Transform (“DCT”) or an integer transform, which approximates DCT. The transform outputs a set of coefficients, each of which is a weighting value for a standard basis pattern. When combined, the weighted basis patterns re-create the block of residual samples. When the transform coefficients are quantized, each coefficient is changed into an integer index that specifies which quantization bin the quantized coefficient belongs to. Quantization reduces the precision of the transform coefficients according to a quantization parameter. Often, the result is a block in which most or all of the coefficients are zero, with a few non-zero coefficients. By tuning the quantization parameter, either (A) more coefficients are set to zero, resulting in higher compression but lower decoded image quality, or (B) more non-zero coefficients remain, resulting in higher decoded image quality but lower compression.
Referring back to unit 150, the video coding process produces a number of values that must be encoded to form the coded bitstream 155. These values may include, for example, (1) the quantized transform coefficient indices, (2) side information to enable the decoder to re-create the prediction, (3) information about the structure of the compressed data and the compression tools used during encoding, and (4) information about the complete video sequence. These values and parameters (also referred to as “syntax elements”) may be converted into binary codes using variable length coding and/or arithmetic coding. Each of these encoding methods produces an efficient, compact binary representation of the information. The coded bitstream 155 can then be stored and/or transmitted.
FIG. 2 is a block diagram of a conventional decoder 200, such as a decoder that complies with the H.264 standard. The conventional decoder 200 receives a coded (i.e., compressed) bitstream 255 and outputs a current decoded frame 210. The conventional decoder 200 includes an intra- and/or inter-frame based prediction unit 220, an inverse transform and inverse quantize (or rescale) unit 245, and an entropy decode unit 250. Elements of the coded bitstream 255 are entropy decoded by unit 250 to produce a set of quantized coefficient indices and other side information used for decoding. The inverse transform and inverse quantize (rescale) unit 245 receives the set of quantized coefficient indices, rescales them and inverse transforms them to output a decoded residual MB 240. (The decoded residual MB 240 should be identical to its corresponding decoded residual MB 165 in the encoder 100.) The intra- and/or inter-frame based prediction unit 220 uses side information decoded from the coded bitstream 255 to create a prediction MB 230. (The prediction MB 230 should be identical to the corresponding prediction MB 130 in the encoder 100). The combiner 235 adds the prediction MB to the decoded residual MB 240 to generate a decoded MB 215. The current decoded frame 210 is formed from a series of decoded MBs.
In currently prevalent block-based video coding standards, a single best prediction candidate is chosen among all prediction candidates to predict the current MB, and then the residual (also referred to as the “prediction error”) MB is represented with a fixed orthogonal transform (e.g. the Discrete Cosine Transform (“DCT”)).
§2.2.1.1 Using Dictionaries for Block-Based Image and Video Coding
In the following, as understood by those having ordinary skill in the art of information theory, “atoms” are elementary signals which can be used to decompose a signal. A “dictionary” is a set of atoms used to decompose a signal. A dictionary is said to be “orthonormal” if all atoms have norm one and are orthogonal to each other. With such a dictionary, representation coefficients of a signal can be computed as inner products of the signal and the atoms of the dictionary. An orthonormal dictionary is also called an orthonormal transform, and for a signal with dimension N, it will have N atoms. An “overcomplete” dictionary is one having more atoms than the dimensions of the signal.
Recent progress in the field of sparse representation has shown that signal representation using a redundant dictionary can be more efficient than using an orthonormal transform, because the redundant dictionary can be designed so that a typical signal can be approximated well by a sparse set of dictionary atoms. (See, e.g., the article, M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311-4322 (2006), incorporated herein by reference.)
Several research groups have attempted using redundant dictionaries for block-based image and video coding. (See, e.g., the article, Karl Skretting and Kjersti Engan, “Image Compression Using Learned Dictionaries by RLS-DLA and Compared with K-SVD,” IEEE International Conference on Acoustics, Speech, and Signal Processing, (2011) (incorporated herein by reference), the article, Joaquin Zepeda, Christine Guillemot, and Ewa Kijak, “Image Compression Using Sparse Representations and the Iteration Tuned and Aligned Dictionary,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 5, pp. 1061-1073 (2011) (incorporated herein by reference), the article, Philippe Schmid-Saugeon and Avideh Zakhor, “Dictionary Design for Matching Pursuit and Application to Motion Compensated Video Coding,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 14, no. 6, pp. 880-886 (2004) (incorporated herein by reference), the article, Je-Won Kang, C-CJ Kuo, Robert Cohen, and Anthony Vetro, “Efficient Dictionary Based Video Coding with Reduced Side Information,” Circuits and Systems (ISCAS), 2011 IEEE International Symposium on. IEEE, 2011, pp. 109-112 (incorporated herein by reference), and the article, Je-Won Kang, Moncef Gabbouj, and C.-C. Jay Kuo, “Sparse/DCT (S/DCT) Two-Layered Representation Of Prediction Residuals for Video Coding,” IEEE Trans. on Image Processing, vol. 22, no. 7, pp. 2711-2722 (July 2013) (incorporated herein by reference).) In these reported dictionary-based video coders, the dictionary atoms are used to represent the motion-compensation error block (that is, the residual MB) for inter-frame video coding.
Instead of using a single dictionary, the article, Je-Won Kang, C-CJ Kuo, Robert Cohen, and Anthony Vetro, “Efficient Dictionary Based Video Coding with Reduced Side Information,” Circuits and Systems (ISCAS), 2011 IEEE International Symposium on. IEEE, 2011, pp. 109-112 (incorporated herein by reference.) uses multiple dictionaries, pre-designed for different residual energy levels. The work in the article, Je-Won Kang, Moncef Gabbouj, and C.-C. Jay Kuo, “Sparse/DCT (S/DCT) Two-Layered Representation Of Prediction Residuals for Video Coding,” IEEE Trans. on Image Processing, vol. 22, no. 7, pp. 2711-2722 (July 2013) (incorporated herein by reference) proposes a two-layered transform coding framework, which finds a sparse representation using orthogonal matching pursuit (“OMP”) with an overcomplete dictionary learned offline using the K-Singular Value Decomposition (“K-SVD”) algorithm and codes the resultant approximation error with the fixed DCT transform. The work in the article, Yipeng Sun, Mai Xu, Xiaoming Tao, and Jianhua Lu, “Online Dictionary Learning Based Intra-Frame Video Coding Via Sparse Representation,” Wireless Personal Multimedia Communications (WPMC), 2012 15th International Symposium on, (2012) (incorporated herein by reference) codes each frame in the intra-frame mode, with a dictionary that is updated in real time based on the previously coded frames. Although such online adaptation can yield a dictionary that matches with the video content very well, it is very demanding computationally.
It would be useful to provide better video decoding and/or decoding techniques.