A video codec comprises an encoder which transforms input video into a compressed representation suitable for storage and/or transmission and a decoder than can uncompress the compressed video representation back into a viewable form. Typically, the encoder discards some information in the original video sequence in order to represent the video in a more compact form, for example at a lower bit rate.
Typical video codecs, operating for example according to the International Telecommunication Union's ITU-T H.263 and H.264 coding standards, encode video information in two phases. In the first phase, pixel values in a certain picture area or “block” are predicted. These pixel values can be predicted, for example, by motion compensation mechanisms, which involve finding and indicating an area in one of the previously encoded video frames (or a later coded video frame) that corresponds closely to the block being coded. Additionally, pixel values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship.
The second phase is one of coding the error between the predicted block of pixels and the original block of pixels. This is typically accomplished by transforming the difference in pixel values using a specified transform. This transform is typically a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference is quantized and entropy encoded.
By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation, (in other words, the quality of the picture) and the size of the resulting encoded video representation (in other words, the file size or transmission bit rate). An example of the encoding process is depicted in FIG. 1.
The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).
After applying pixel prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel values) to form the output video frame.
The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming frames in the video sequence. An example of the decoding process is depicted in FIG. 2.
In typical video codecs, the motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prediction source block in one of the previously coded or decoded images (or pictures). In order to represent motion vectors efficiently, motion vectors are typically coded differentially with respect to block specific predicted motion vector. In a typical video codec, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
In typical video codecs the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
Typical video encoders utilize the Lagrangian cost function to find optimal coding modes, for example the desired macro block mode and associated motion vectors. This type of cost function uses a weighting factor or λ to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information required to represent the pixel values in an image area.
This may be represented by the equation:C=D+λR  (1)where C is the Lagrangian cost to be minimised, D is the image distortion (in other words the mean-squared error) with the mode and motion vectors currently considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).
When source symbols are coded using code words which may have different lengths, the source symbols are translated to unique code words. This kind of coding can be called as a variable length coding (VLC). The coding may be designed so that more probable symbols are represented with shorter code words and less probable symbols are represented with longer code words. Shorter code words can be represented with less bits compared to longer code words when the code words are transmitted. One aim of the variable length coding is to reduce the amount of information needed to represent the symbols compared to the situation that the symbols were encoded as such. In other words, when a set of symbols is translated to code words, the resulting coded representation should contain fewer bits than the source. The set of symbols may include many kinds of information. For example, a set of symbols can be a file consisting of bytes, an information stream such as a video stream or an audio stream, an image, etc.
The design of variable length code words can depend on the probability statistics of the source of which the source symbols represent. To obtain a set of code words for variable length coding probability statistics can be gathered from some representative source material and the code words are designed around those statistics. This may work quite well, but in many cases statistics are not stationary and may vary in time and having fixed set of code words may not produce good compression. To achieve better compression, the set of variable length code words could be constantly adapted locally to observed statistics of the source.
One way of performing adaptation is to keep track of symbol frequencies and use the frequencies to define the set of variable length code words on-the-fly as the symbols are coded. This kind of full adaptation is quite complex operation, especially if the range of source symbols is large. In practical implementations, some form of suboptimal adaptation may be performed. For example, the encoder could use a number of predefined sets of variable length code words and select one set of them based on estimation of local statistics. In another implementation coder could gradually adapt the code words of the set so that only few of the individual code words of the set are changed at a time so that the complexity per coded code word is low.