This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section
A video codec may comprise an encoder which transforms input video into a compressed representation suitable for storage and/or transmission and a decoder that can uncompress the compressed video representation back into a viewable form, or either one of them. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example at a lower bit rate.
Many hybrid video codecs, operating for example according to the International Telecommunication Union's ITU-T H.263 and H.264 coding standards, encode video information in two phases. In the first phase, pixel values in a certain picture area or “block” are predicted. These pixel values can be predicted, for example, by motion compensation mechanisms, which involve finding and indicating an area in one of the previously encoded video frames (or a later coded video frame) that corresponds closely to the block being coded. Additionally, pixel values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship, for example by using pixel values around the block to be coded in a specified manner.
Prediction approaches using image information from a previous (or a later) image can also be called as Inter prediction methods, and prediction approaches using image information within the same image can also be called as Intra prediction methods.
The second phase is one of coding the error between the predicted block of pixels and the original block of pixels. This may be accomplished by transforming the difference in pixel values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference may be quantized and entropy encoded.
By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation, (in other words, the quality of the picture) and the size of the resulting encoded video representation (in other words, the file size or transmission bit rate).
The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).
After applying pixel prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel values) to form the output video frame.
The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming frames in the video sequence.
In some video codecs, such as High Efficiency Video Coding Working Draft 4, video pictures may be divided into coding units (CU) covering the area of a picture. A coding unit consists of one or more prediction units (PU) defining the prediction process for the samples within the coding unit and one or more transform units (TU) defining the prediction error coding process for the samples in the coding unit. A coding unit may consist of a square block of samples with a size selectable from a predefined set of possible coding unit sizes. A coding unit with the maximum allowed size can be named as a largest coding unit (LCU) and the video picture may be divided into non-overlapping largest coding units. A largest coding unit can further be split into a combination of smaller coding units, e.g. by recursively splitting the largest coding unit and resultant coding units. Each resulting coding unit may have at least one prediction unit and at least one transform unit associated with it. Each prediction unit and transform unit can further be split into smaller prediction units and transform units in order to increase granularity of the prediction and prediction error coding processes, respectively. Each prediction unit may have prediction information associated with it defining what kind of a prediction is to be applied for the pixels within that prediction unit (e.g. motion vector information for inter predicted prediction units and intra prediction directionality information for intra predicted prediction units). Similarly, each transform unit may be associated with information describing the prediction error decoding process for samples within the transform unit (including e.g. discrete cosine transform (DCT) coefficient information). It may be signalled at coding unit level whether prediction error coding is applied or not for each coding unit. In the case there is no prediction error residual associated with the coding unit, it can be considered there are no transform units for the coding unit. The division of the image into coding units, and division of coding units into prediction units and transform units may be signalled in the bitstream allowing the decoder to reproduce the intended structure of these units.
In some video codecs, motion information is indicated by motion vectors associated with each motion compensated image block. These motion vectors represent the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prediction source block in one of the previously coded or decoded images (or pictures). In order to represent motion vectors efficiently, motion vectors may be coded differentially with respect to block specific predicted motion vector. In some video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
Another way to create motion vector predictions is to generate a list or a set of candidate predictions from blocks in the current frame and/or co-located or other blocks in temporal reference pictures and signalling the chosen candidate as the motion vector prediction. A spatial motion vector prediction is a prediction obtained only on the basis of information of one or more blocks of the same frame than the current frame whereas temporal motion vector prediction is a prediction obtained on the basis of information of one or more blocks of a frame different from the current frame. It may also be possible to obtain motion vector predictions by combining both spatial and temporal prediction information of one or more encoded blocks. These kinds of motion vector predictions are called as spatio-temporal motion vector predictions.
In addition to predicting the motion vector values, the reference index in the reference picture list can be predicted. The reference index may be predicted from blocks in the current frame and/or co-located or other blocks in a temporal reference picture. Moreover, some high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, may be predicted and used without any modification or correction. Similarly, predicting the motion field information may be carried out using the motion field information of blocks in the current frame and/or co-located or other blocks in temporal reference pictures and the used motion field information is signalled among a list of motion field candidate list filled with motion field information of available blocks in the current frame and/or co-located or other blocks in temporal reference pictures.
In some video codecs the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
Some video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor λ (lambda) to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:C=D+λR  (1)
where C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).