A video sequence consists of a large number video frames, which are formed of a large number of pixels each of which is represented by a set of digital bits. Because of the large number of pixels in a video frame and the large number of video frames even in a typical video sequence, the amount of data required to represent the video sequence quickly becomes large. For instance, a video frame may include an array of 640 by 480 pixels, each pixel having an RGB (red, green, blue) color representation of eight bits per color component, totaling 7,372,800 bits per frame. Video sequences comprise a sequence of still images, which are recorded/displayed at a rate of typically 15-30 frames per second. The amount of data needed to transmit information about each pixel in each frame separately would thus be enormous.
Video coding tackles the problem of reducing the amount of information that needs to be transmitted in order to present the video sequence with an acceptable image quality. For example, in videotelephony the encoded video information is transmitted using conventional telephone networks, where transmission bit rates are typically multiples of 64 kilobits/s. In mobile videotelephony, where transmission takes place at least in part over a radio communications link, the available transmission bit rates can be as low as 20 kilobits/s.
In typical video sequences the change of the content of successive frames is to a great extent the result of the motion in the scene. This motion may be due to camera motion or due to motion of the objects present in the scene. Therefore typical video sequences are characterized by significant temporal correlation, which is highest along the trajectory of the motion. Efficient compression of video sequences usually takes advantage of this property of video sequences. Motion compensated prediction is a widely recognized technique for compression of video. It utilizes the fact that in a typical video sequence, image intensity/chrominance value in a particular frame segment can be predicted using image intensity/chrominance values of some other already coded and transmitted frame, given the motion trajectory between these two frames. Occasionally it is advisable to transmit a whole frame, to prevent the deterioration of image quality due to accumulation of errors and to provide additional functionalities, for example, random access to the video sequence).
A schematic diagram of an example video coding system using motion compensated prediction is shown in FIGS. 1 and 2 of the accompanying drawings. FIG. 1 illustrates an encoder 10 employing motion compensation and FIG. 2 illustrates a corresponding decoder 20. The operating principle of video coders using motion compensation is to minimize the prediction error Same En(x,y), which is the difference between the current frame In(x,y) being coded and a prediction frame Pn(x,y). The prediction error frame is thusEn(x,y)=In(x,y)−Pn(x,y)  (1)
The prediction frame is built using pixel values of a reference frame Rn(x,y), which is one of the previously coded and transmitted frames (for example, a frame preceding the current frame), and the motion of pixels between the current frame and the reference frame. The motion of the pixels may be presented as the values of horizontal and vertical displacements Δx(x,y) and Δy(x,y) of a pixel at location (x,y) in the current frame In(x,y). The pair of numbers [Δx(x,y),Δy(x,y)] is called the motion vector of this pixel. The motion vectors are typically represented using some known functions (called basis functions) and coefficients (this is discussed in more detail below), and an approximate motion vector field ({tilde over (Δ)}x(x,y),{tilde over (Δ)}y(x,y)) can be constructed using the coefficients and the basis functions.
The prediction frame is given byPn(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)],  (2)where the reference frame Rn(x,y) is available in the Frame Memory 17 of the encoder 10 and in the Frame memory 24 of the decoder 20 at a given instant. An information stream 2 carrying information about the motion vectors is combined with information about the prediction error (1) in the multiplexer 16 and an information stream (3) containing typically at least those two types of information is sent to the decoder 20.
In the Prediction Error Coding block 14, the prediction error frame En(x,y) is typically compressed by representing it as a finite series (transform) of some 2-dimensional functions. For example, a 2-dimensional Discrete Cosine Transform (DCT) can be used. The transform coefficients related to each function are quantized and entropy coded before they are transmitted to the decoder (information stream 1 in FIG. 1). Because of the error introduced by quantization, this operation usually produces some degradation in the prediction error frame En(x,y).
In the Frame Memory 24 of the decoder 20 there is a previously reconstructed reference frame Rn(x,y). Using the decoded motion information ({tilde over (Δ)}x(x,y),{tilde over (Δ)}y(x,y)) and Rn(x,y)it is possible to reconstruct the prediction frame Pn(x,y) in the Motion Compensated Prediction block 21 of the decoder 20. The transmitted transform coefficients of the prediction error frame En(x,y) are used in the Prediction Error Decoding block 22 to construct the decoded prediction error frame {tilde over (E)}n(x,y). The pixels of the decoded current frame Ĩn(x,y) are reconstructed by adding the prediction frame Pn(x,y) and the decoded prediction error frame {tilde over (E)}n(x,y)Ĩn(x,y)=Pn(x,y)+{tilde over (E)}n(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)]+{tilde over (E)}(x,y).  (3)
This decoded current frame may be stored in the Frame Memory 24 as the next reference frame Rn+1(x,y).
Let us next discuss in more detail the motion compensation and transmission of motion information. The construction of the prediction frame Pn(x,y) in the Motion Compensated Prediction block 13 of the encoder 10 requires information about the motion in the current frame In(x,y). Motion vectors [Δx(x,y),Δy(x,y)] are calculated in the Motion Field Estimation block 11 in the encoder 10. The set of motion vectors of all pixels of the current frame [Δx(·),Δy(·)] is called the motion vector field. Due to the very large number of pixels in a frame it is not efficient to transmit a separate motion vector for each pixel to the decoder. Instead, in most video coding schemes the current frame is divided into larger image segments and information about the segments is transmitted to the decoder.
The motion vector field is coded in the Motion Field Coding block 12 of the encoder 10. Motion Field Coding refers to representing the motion in a frame using some predetermined functions or, in other words, representing it with a model. Almost all of the motion vector field models commonly used are additive motion models. Motion compensated video coding schemes may define the motion vectors of image segments by the following general formula:
                              Δ          ⁢                                          ⁢                      x            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      N              -              1                                ⁢                                          ⁢                                    a              i                        ⁢                                          f                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        4        )                                          Δ          ⁢                                          ⁢                      y            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      M              -              1                                ⁢                                          ⁢                                    b              i                        ⁢                                          g                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        5        )            where coefficients ai and bi are called motion coefficients. They are transmitted to the decoder. Functions ƒi and gi are called motion field basis functions, and they are known both to the encoder and decoder.
In order to minimize the amount of information needed in sending the motion coefficients to the decoder, coefficients can be predicted from the coefficients of the neighboring segments. When this kind of motion field prediction is used, the motion field is expressed as a sum of a prediction motion field and refinement motion field. The prediction motion field uses the motion vectors associated with neighboring segments of the current frame. The prediction is performed using the same set of rules and possibly some auxiliary information in both encoder and decoder. The refinement motion field is coded, and the motion coefficients related to this refinement motion field are transmitted to the decoder. This approach typically results in savings in transmission bit rate. The dashed lines in FIG. 1 illustrate some examples of the possible information some motion estimation and coding schemes may require in the Motion Field Estimation block 11 and in the Motion Field Coding block 12.
Polynomial motion models are a widely used family of models. (See, for example H. Nguyen and E. Dubois, “Representation of motion information for image coding,” in Proc. Picture Coding Symposium '90, Cambridge, Mass., Mar. 26-18, 1990, pp. 841-845 and Centre de Morphologie Mathematique (CMM), “Segmentation algorithm by multicriteria region merging,” Document SIM(95)19, COST 211ter Project Meeting, May 1995). The values of motion vectors are described by functions which are linear combinations of two dimensional polynomial functions. The translational motion model is the simplest model and requires only two coefficients to describe the motion vectors of each segment. The values of motion vectors are given by the formulae:Δx(x,y)=a0 Δy(x,y)=b0  (6)
This model is widely used in various international standards (ISO MPEG-1, MPEG-2, MPEG-4, ITU-T Recommendations H.261 and H.263) to describe motion of 16×16 and 8×8 pixel blocks. Systems utilizing a translational motion model typically perform motion estimation at fill pixel resolution or some integer fraction of full pixel resolution, for example with an accuracy of ½ or ⅓ pixel resolution.
Two other widely used models are the affine motion model given by the equation:Δx(x,y)=a0+a1x+a2y Δy(x,y)=b0+b1x+b2y  (7)and the quadratic motion model given by the equation:Δx(x,y)=a0+a1x+a2y+a3xy+a4x2+a5y2Δy(x,y)=b0+b1x+b2y+b3xy+b4x2+b5y2  (8)
The affine motion model presents a very convenient trade-off between the number of motion coefficients and prediction performance. It is capable of representing some of the common real-life motion types such as translation, rotation, zoom and shear with only a few coefficients. The quadratic motion model provides good prediction performance, but it is less popular in coding than the affine model, since it uses more motion coefficients, while the prediction performance is not substantially better. Furthermore, it is computationally more costly to estimate the quadratic motion than to estimate the affine motion.
When the motion field is estimated using higher order motion models (such as presented, for example, in equations 7 and 8), the motion field estimation results in a motion field represented by real numbers. In this case the motion coefficients need to be quantized to a discrete accuracy before they are transmitted to the decoder.
The Motion Field Estimation block 11 calculates motion vectors [Δx(x,y),Δy(x,y)] of the pixels of a given segment Sk which minimize some measure of prediction error in the segment. In the simplest case the motion field estimation uses the current frame In(x,y) and the reference frame Rn(x,y) as input values. Typically the Motion Field Estimation block outputs the motion field [Δx(x,y),Δy(x,y)] to the Motion Field Coding block 12. The Motion Field Coding block makes the final decisions on what kind of motion vector field is transmitted to the decoder and how the motion vector field is coded. It can modify the motion model and motion coefficients in order to minimize the amount of information needed to describe a satisfactory motion vector field.
The image quality of transmitted video frames depends on the accuracy with which the prediction frame can be constructed, in other words on the accuracy of the transmitted motion information, and on the accuracy with which the prediction error information is transmitted. Here the term accuracy refers not only to the ability of the emotion field model to represent the motion within the frame but also to the numerical precision with which the motion information and the prediciton error information is represented. Motion information transmitted with hich accuracy may be canceled out in the decoding phase due to low accuracy of the precidiction error frame, or vice versa.
Current video coding systems employ various motion estimation and coding techniques, as discussed above. The accuracy of the motion information and the transmission bit rate needed to transmit the motion information are typically dictated by the choice of the motion estimation and coding technique, and a chosen technique is usually applied to a whole video sequence. Generally, as the accuracy of the transmitted motion information increases, the amount of transmitted information increases.
In general, better image quality requires larger amounts of transmitted information. Typically, if the available transmission bit rate is limited, this limitation dictates the best possible image quality of transmitted video frames. It is also possible to aim for a certain target image quality, and the transmission bit rate then depends on the target image quality. In current video coding and decoding systems, the trade-offs between the required transmission bit rate and image quality are mainly made by adjusting the accuracy of the information presenting the prediction error frame. This accuracy may change, for example, from frame to frame, or even between different segments of a frame.
The problem in changing the accuracy of the transmitted prediction error frame is that it may cause unexpected degradation of the overall performance of the video encoding, for example, when conforming to a new available transmission bit rate. In other words, the image quality achieved is not as good as that expected considering the transmission bit rate. The image quality may deteriorate drastically, when a lower transmission bit rate is available, or the image quality may not be enhanced even though a higher transmission bit rate is used.