A video codec may comprise an encoder which transforms input video into a compressed representation suitable for storage and/or transmission and a decoder that can uncompress the compressed video representation back into a viewable form, or either one of them. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example at a lower bit rate.
Many hybrid video codecs, operating for example according to the International Telecommunication Union's ITU-T H.263 and H.264 coding standards, encode video information in two phases. In the first phase, pixel values in a certain picture area or “block” are predicted. These pixel values can be predicted, for example, by motion compensation mechanisms, which involve finding and indicating an area in one of the previously encoded video frames (or a later coded video frame) that corresponds closely to the block being coded. Additionally, pixel values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship, for example by using pixel values around the block to be coded in a specified manner.
Prediction approaches using image information from a previous (or a later) image can also be called as Inter prediction methods, and prediction approaches using image information within the same image can also be called as Intra prediction methods.
The second phase is one of coding the error between the predicted block of pixels and the original block of pixels. This is typically accomplished by transforming the difference in pixel values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference may be quantized and entropy encoded.
By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation, (in other words, the quality of the picture) and the size of the resulting encoded video representation (in other words, the file size or transmission bit rate).
An example of the encoding process is illustrated in FIG. 1.
The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).
After applying pixel prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel values) to form the output video frame.
The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming frames in the video sequence.
An example of the decoding process is illustrated in FIG. 2.
Motion Compensated Prediction (MCP) is a technique used by video compression standards to reduce the size of an encoded bitstream. In MCP, a prediction for a current frame is formed using a previously coded frame(s), where only the difference between original and prediction signals, representative of the current and predicted frames, is encoded and sent to a decoder. A prediction signal, representative of a prediction frame, is formed by first dividing a current frame into blocks, e.g., macroblocks, and searching for a best match in a reference frame for each block. In this way, the motion of a block relative to the reference frame is determined and this motion information is coded into a bitstream as motion vectors. A decoder is able to reconstruct the exact prediction frame by decoding the motion vector data encoded in the bitstream.
An example of a prediction structure is presented in FIG. 8. Boxes indicate pictures, capital letters within boxes indicate coding types, numbers within boxes are picture numbers (in decoding order), and arrows indicate prediction dependencies. In this example I-pictures are intra pictures which do not use any reference pictures and thus can be decoded irrespective of the decoding of other pictures. P-pictures are so called uni-predicted pictures i.e. they refer to one reference picture, and B-pictures are bi-predicted pictures which use two other pictures as reference pictures, or two prediction blocks within one reference picture. In other words, the reference blocks relating to the B-picture may be in the same reference picture (as illustrated with the two arrows from picture P7 to picture B8 in FIG. 8) or in two different reference pictures (as illustrated e.g. with the arrows from picture P2 and from picture B3 to picture B4 in FIG. 8).
It should also be noted here that one picture may include different types of blocks i.e. blocks of a picture may be intra-blocks, uni-predicted blocks, and/or bi-predicted blocks. Motion vectors often relate to blocks wherein for one picture a plurality of motion vectors may exist.
In some systems the uni-predicted pictures are also called as uni-directionally predicted pictures and the bi-predicted pictures are called as bi-directionally predicted pictures.
The motion vectors are not limited to having full-pixel accuracy, but could have fractional-pixel accuracy as well. That is, motion vectors can point to fractional-pixel positions/locations of the reference frame, where the fractional-pixel locations can refer to, for example, locations “in between” image pixels. In order to obtain samples at fractional-pixel locations, interpolation filters may be used in the MCP process. Conventional video coding standards describe how a decoder can obtain samples at fractional-pixel accuracy by defining an interpolation filter. In MPEG-2, for example, motion vectors can have at most, half-pixel accuracy, where the samples at half-pixel locations are obtained by a simple averaging of neighboring samples at full-pixel locations. The H.264/AVC video coding standard supports motion vectors with up to quarter-pixel accuracy. Furthermore, in the H.264/AVC video coding standard, half-pixel samples are obtained through the use of symmetric and separable 6-tap filters, while quarter-pixel samples are obtained by averaging the nearest half or full-pixel samples.
In typical video codecs, the motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prediction source block in one of the previously coded or decoded images (or pictures). In order to represent motion vectors efficiently, motion vectors are typically coded differentially with respect to block specific predicted motion vector. In a typical video codec, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
In typical video codecs the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
Typical video encoders utilize the Lagrangian cost function to find optimal coding modes, for example the desired macro block mode and associated motion vectors. This type of cost function uses a weighting factor or λ to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information required to represent the pixel values in an image area.
This may be represented by the equation:C=D+□λR  (1)
where C is the Lagrangian cost to be minimised, D is the image distortion (for example, the mean-squared error between the pixel values in original image block and in coded image block) with the mode and motion vectors currently considered, λ is a Lagrangian coefficient and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).
Some hybrid video codecs, such as H.264/AVC, utilize bi-directional motion compensated prediction to improve the coding efficiency. In bi-directional prediction, prediction signal of the block may be formed by combining, for example by averaging two motion compensated prediction blocks. This averaging operation may further include either up or down rounding, which may introduce rounding errors.
The accumulation of rounding errors in bi-directional prediction may cause degradation in coding efficiency. This rounding error accumulation may be removed or decreased by signalling whether rounding up or rounding down have been used when the two prediction signals have been combined for each frame. Alternatively the rounding error could be controlled by alternating the usage of the rounding up and rounding down for each frame. For example, rounding up may be used for every other frame and, correspondingly, rounding down may be used for every other frame.
In FIG. 9 an example of averaging two motion compensated prediction blocks using rounding is illustrated. Sample values of the first prediction reference is input 902 to a first filter 904 in which values of two or more full pixels near the point which the motion vector is referring to are used in the filtering. A rounding offset may be added 906 to the filtered value. The filtered value added with the rounding offset is right shifted 908 x-bits i.e. divided by 2x to obtain a first prediction signal P1. Similar operation is performed to the second prediction reference as is illustrated with blocks 912, 914, 916 and 918 to obtain a second prediction signal P2. The first prediction signal P1 and the second prediction signal P2 are combined e.g. by summing the prediction signals P1, P2. A rounding offset may be added 920 with the combined signal after which the result is right shifted y-bits i.e. divided by 2y. The rounding may be upwards, if the rounding offset is positive, or downwards, if the rounding offset is negative. The direction of the rounding may always be the same, or it may alter from time to time, e.g. for each frame. The direction of the rounding may be signaled in the bitstream so that in the decoding process the same rounding direction can be used.
However, these methods increase somewhat the complexity as two separate code branches need to be written for bi-directional averaging. In addition, the motion estimation routines in the encoder may need to be doubled for both cases of rounding and truncation.