This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
A digital video signal comprises a sequence of still images (also referred to as “pictures” or “frames”) in an uncompressed digital format. Each video frame is formed from an array of pixels. For example, in a digital image format known as the Quarter Common Interchange Format (QCIF), an image or frame comprises 25,344 pixels arranged in an array 176×144 pixels. The goal of the video encoding (coding or compression) is to reduce the data to represent a video signal. In general, there is a significant degree of correlation between neighboring pixel values within an image of a sequence of images. Referred as spatial redundancy, in practical terms, this means that the value of any pixel within an image is substantially the same as the value of other pixels in its immediate vicinity. Additionally, consecutive images of an image sequence also tend to be quite similar. Thus, the overall change between one image and the next is rather small. This means that there is considerable temporal redundancy within a typical digital image sequence. A video encoder transforms an input video into a compressed representation suitable for storage and/or transmission, and a video decoder uncompresses the compressed content representation back into a viewable form.
State of the art existing video coding systems reduce the amount of data used to represent the video signal by exploiting spatial and temporal redundancies within the sequence of images. Such “hybrid” video coding methods, for example used in ITU-T H.263 and H.264, encode the video information in two phases. First, pixel values in a certain picture area or “block” are predicted using, for example, motion compensation mechanisms or spatial mechanisms. Motion compensation mechanisms may include, for example, finding and indicating an area in a previously coded video frame that corresponds closely to the block being coded. Spatial mechanisms may include, for example, using the pixel values around the block to be coded in a specified manner. Second, the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically accomplished by transforming the difference in pixel values using a specified transform such as a Discreet Cosine Transform (DCT) or a variant thereof, quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (i.e., the picture quality) and the size of the resulting coded video representation (e.g., the file size or transmission bitrate).
The decoder reconstructs the output video by applying prediction mechanisms similar to those used by the encoder in order to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and performing prediction error decoding. Prediction error decoding is the inverse operation of prediction error coding and is used to recover the quantized prediction error signal in a spatial pixel domain. After applying prediction and prediction error decoding mechanisms, the decoder sums up the prediction and prediction error signals (i.e., the pixel values) to form the output video frame. The decoder and encoder can also apply additional filtering mechanisms to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
In some video codecs, the motion information is indicated with motion vectors associated with each motion compensated image block. Each motion vector represents the displacement of the image block in the picture to be coded (at the encoder side) or decoded (at the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to efficiently represent motion vectors, these are often coded differentially with respect to block specific predicted motion vectors. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
A number of video encoders utilize Lagrangian cost functions to determine optimal coding modes, e.g. the desired macroblock mode and associated motion vectors. This type of cost function uses a weighting factor X to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information that is required to represent the pixel values in an image area C=D+λR, where C is the Lagrangian cost to be minimized, D is the image distortion (e.g., the mean squared error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data used to represent the candidate motion vectors).