This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
A video codec comprises an encoder that transforms input video into a compressed representation suited for storage and/or transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically, the encoder discards some information in the original video sequence in order to represent the video in a more compact form, i.e., at a lower bitrate.
Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode video information in two phases. In the first phase, pixel values in a certain picture area or “block” are predicted. These pixel values can be predicted, for example, by motion compensation mechanisms, which involve finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded. Additionally, pixel values can be predicted via by spatial mechanisms, which involve using the pixel values around the block to be coded in a specified manner. The second phase involves coding the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels. This is typically accomplished by transforming the difference in pixel values using a specified transform (e.g., a Discreet Cosine Transform (DCT) or a variant thereof), quantizing the coefficients, and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (i.e., the picture quality) and the size of the resulting coded video representation (i.e., the file size or transmission bitrate).
The decoder reconstructs output video by applying prediction mechanisms that are similar to those used by the encoder in order to form a predicted representation of the pixel blocks (using motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (the inverse operation of the prediction error coding, recovering the quantized prediction error signal in the spatial pixel domain). After applying prediction and prediction error decoding processes, the decoder sums up the prediction and prediction error signals (i.e., the pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing it as a prediction reference for the forthcoming frames in the video sequence.
In typical video codecs, the motion information is indicated with motion vectors associated with each motion-compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, motion vectors are typically coded differentially with respect to block specific predicted motion vectors. In a typical video codec, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g., the desired macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor λ to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information that is required to represent the pixel values in an image area:C=D+λR  (1)
In Eq. (1), C is the Lagrangian cost to be minimized, D is the image distortion (e.g., the mean squared error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).
Transform coding of the prediction error signal in video or image compression system typically comprises DCT-based linear transform, quantization of the transformed DCT coefficients, and context based entropy coding of the quantized coefficients. However, the transform can efficiently pack energy of the prediction error signal only under certain statistics, and the coding performance deteriorates when the prediction error to be transformed becomes less correlated. This causes suboptimal performance, especially in modern video and image coding systems employing advanced motion compensation and spatial prediction processes in order to achieve good quality predictions for the image blocks to be coded (thus, minimizing and decorrelating the prediction error signal).
To address some of the above issues, a number of hybrid video coding schemes have been developed. These hybrid systems typically comprise a hybrid of two redundancy reduction techniques—prediction and transformation. Prediction can take the form of inter-picture prediction, which is used to remove temporal redundancies in the signal. Intra-picture prediction may also be used in the H.264/Advanced Video Coding (AVC) standard where spatial redundancies are removed by exploiting the similarities between neighboring regions within a picture frame. As a consequence of these inter-picture and intra-picture prediction techniques, a residual/error signal is formed by removing the predicated picture frame from the original. This prediction error signal is then typically block transform coded using an 8×8 DCT transform in order to reduce spatial redundancies in the signal.