A video sequence typically consists of a large number video frames, which are formed of a large number of pixels each of which is represented by a set of digital bits. Because of the large number of pixels in a video frame and the large number of video frames even in a typical video sequence, the amount of data required to represent the video sequence quickly becomes large. For instance, a video frame may include an array of 640 by 480 pixels, each pixel having an RGB (red, green, blue) color representation of eight bits per color component, totaling 7,372,800 bits per frame. Another example is a QCIF (quarter common intermediate format) video frame including 176×144 pixels. QCIF provides an acceptably sharp image on small (a few square centimeters) LCD displays, which are typically available in mobile communication devices. Again, if the color of each pixel is represented using eight bits per color component, the total number of bits per frame is 608,256.
Alternatively, a video frame can be presented using a related luminance/chrominance model, known as the YUV color model The human visual system is more sensitive to intensity (luminance) variations that it is to color (chrominance) variations. The YUV color model exploits this property by representing an image in terms of a luminance component Y and two chrominance components U, V, and by using a lower resolution for the chrominance components than for the luminance component In this way the amount of information needed to code the color information in an image can be reduced with an acceptable reduction in image quality. The lower resolution of the chrominance components is usually attained by spatial sub-sampling. Typically a block of 16×16 pixels in the image is coded by one block of 16×16 pixels representing the luminance information and by one block of 8×8 pixels for each chrominance component. The chrominance components are thus sub-samples by a factor of 2 in the x and y directions. The resulting assembly of one 16×16 pixel luminance block and two 8×8 pixel chrominance blocks is here referred to as a YUV macroblock. A QCIF image comprises 11×9 YUV macroblocks. The luminance blocks and chrominance blocks are represented with 8 bit resolution, and the total number of bits required per YUV macroblock is (16×16×8)+2×(8×8×8)=3072 bits. The number of bits needed to represent a video frame is thus 99×3072=304,128 bits.
In a video sequences comprising a sequence of frames in YUV coded QCIF format recorded/displayed at a rate of 15-30 frames per second, the amount of data needed to transmit information about each pixel in each frame separately would thus be more than 4 Mbps (million bits per second). In conventional videotelephony, where the encoded video information is transmitted using fixed-line telephone networks, the transmission bit rates are typically multiples of 64 kilobits/s. In mobile videotelephony, where transmission takes place at least in part over a radio communications link the available transmission bit rates can be as low as 20 kilobits/s. Therefore it is clearly evident that methods are required whereby the amount of information used to represent a video sequence can be reduced. Video coding tackles the problem of reducing the amount of information that needs to be transmitted in order to present the video sequence with an acceptable image quality.
In typical video sequences the change of the content of successive frames is to a great extent the result of the motion in the scene. This motion may be due to camera motion or due to motion of the objects present in the scene. Therefore, typical video sequences are characterized by significant temporal correlation, which is highest along the trajectory of the motion. Efficient compression of video sequences usually takes advantage of this property of video sequences. Motion compensated prediction is a widely recognized technique for compression of video. It utilizes the fact that in a typical video sequence, image intensity/chrominance values in a particular frame segment can be predicted using image intensity/chrominance values of a segment in some other already coded and transmitted frame, given the motion trajectory between these two frames. Occasionally, it is advisable to transmit a frame that is coded without reference to any other frames, to prevent deterioration of image quality due to accumulation of errors and to provide additional functionality such as random access to the video sequence. Such a frame is called an INTRA frame.
A schematic diagram of an example video coding system using motion compensated prediction is shown in FIGS. 1 and 2 of the accompanying drawing. FIG. 1 illustrates an encoder 10 employing motion compensation and FIG. 2 illustrates a corresponding decoder 20. The operating principle of video coders using motion compensation is to minimize the prediction error frame En(x,y), which is the difference between the current frame In(x,y) being coded and a prediction frame Pn(x,y). The prediction error frame is thusEn(x,y)=In(x,y)−Pn(x,y).  (1)
The prediction frame Pn(x,y) is built using pixel values of a reference frame Rn(x,y), which is one of the previously coded and transmitted frames (for example, a frame preceding the current frame), and the motion of pixels between the current frame and the reference frame. More precisely, the prediction frame is constructed by finding the prediction pixels in the reference frame Rn(x,y) and moving the prediction pixels as the motion information specifies. The motion of the pixels may be presented as the values of horizontal and vertical displacements Δx(x,y) and Δy(x,y) of a pixel at location (x,y) in the current frame In(x,y). The pair of numbers [Δx(x,y),Δy(x,y)] is called the motion vector of this pixel.
The motion vectors [Δx(x,y),Δy(x,y)] are calculated in the Motion Field Estimation block 11 in the encoder 10. The set of motion vectors of all pixels of the current frame [Δx(.),Δy(.)] is called the motion vector field. Due to the very large number of pixels in a frame it is not efficient to transmit a separate motion vector for each pixel to the decoder. Instead, in most video coding schemes the current frame is divided into larger image segments Sk and information about the segments is transmitted to the decoder.
The motion vector field is coded in the Motion Field Coding block 12 of the encoder 10. Motion Field Coding refers to the process of representing the motion in a frame using some predetermined functions or, in other words, representing it with a model. Almost all of the motion vector field models commonly used are additive motion models. Motion compensated video coding schemes may define the motion vectors of image segments by the following general formula:                               Δ          ⁢                                           ⁢                      x            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      N              -              1                                ⁢                                    a              i                        ⁢                                          f                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        2        )                                          Δ          ⁢                                           ⁢                      y            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      M              -              1                                ⁢                                    b              i                        ⁢                                          g                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        3        )            where coefficients ai and bi are called motion coefficients. They are transmitted to the decoder (information stream 2 in FIGS. 1 and 2). Functions ƒi and gi are called motion field basis functions, and they are known both to the encoder and decoder. An approximate motion vector field ({tilde over (Δ)}x(x,y),{tilde over (Δ)}y(x,y)) can be constructed using the coefficients and the basis functions.
The prediction frame Pn(x,y) is constructed in the Motion Compensated Prediction block 13 in the encoder 10, and it is given byPn(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)],  (4)where the reference frame Rn(x,y) is available in the Frame Memory 17 of the encoder 10 at a given instant.
In the Prediction Error Coding block 14, the prediction error frame En(x,y) is typically compressed by representing it as a finite series (transform) of some 2-dimensional functions. For example, a 2-dimensional Discrete Cosine Transform (DCT) can be used. The transform coefficients related to each function are quantized and entropy coded before they are transmitted to the decoder (information stream 1 in FIGS. 1 and 2). Because of the error introduced by quantization, this operation usually produces some degradation in the prediction error frame En(x,y). To cancel this degradation, a motion compensated encoder comprises a Prediction Error Decoding block 15, where the a decoded prediction error frame {tilde over (E)}n(x,y) is constructed using the transform coefficients. This decoded prediction error Same is added to the prediction frame Pn(x,y) and the resulting decoded current frame Ĩn(x,y) is stored to the Frame Memory 17 for further use as the next reference frame Rn+1(x,y).
The information stream 2 carrying information about the motion vectors is combined with information about the prediction error in the multiplexer 16 and an information stream (3) containing typically at least those two types of information is sent to the decoder 20.
In the Frame Memory 24 of the decoder 20 there is a previously reconstructed reference frame Rn(x,y). The prediction frame Pn(x,y) is constructed in the Motion Compensated Prediction block 21 in the decoder 20 similarly as in the Motion Compensated Prediction block 13 in the encoder 10. The transmitted transform coefficients of the prediction error frame En(x,y) are used in the Prediction Error Decoding block 22 to construct the decoded prediction error frame {tilde over (E)}n(x,y). The pixels of the decoded current frame Ĩn(x,y) are reconstructed by adding the prediction frame Pn(x,y) and the decoded prediction error frame {tilde over (E)}n(x,y)Ĩn(x,y)=Pn(x,y)+{tilde over (E)}n(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)]+{tilde over (E)}n(x,y).  (5)
This decoded current frame may be stored in the Frame Memory 24 as the next reference frame Rn+1(x,y).
Let us next discuss in more detail the motion compensation and transmission of motion information In order to minimize the amount of information needed in sending the motion coefficients to the decoder, coefficients can be predicted from the coefficients of the neighboring segments. When this kind of motion field prediction is used, the motion field is expressed as a sum of a prediction motion field and a refinement motion field The prediction motion field is constructed using the motion vectors associated with neighboring segments of the current frame. The prediction is performed using the same set of rules and possibly some auxiliary information in both encoder and decoder. The refinement motion field is coded, and the motion coefficients related to this refinement motion field are transmitted to the decoder. This approach typically results in savings in transmission bit rate. The dashed lines in FIG. 1 illustrate some examples of the possible information some motion estimation and coding schemes may require in the Motion Field Estimation block 11 and in the Motion Field Coding block 12.
Polynomial motion models are a widely used family of models. (See, for example H. Nguyen and E. Dubois, “Representation of motion information for image coding,” in Proc. Picture Coding Symposium '90, Cambridge, Mass. Mar. 26-18, 1990, pp. 841-845 and Centre de Morphologie Mathematique (CMM), “Segmentation algorithm by multicriteria region merging,” Document SIM(95)19, COST 211 ter Project Meeting, May 1995).
The values of motion vectors are described by functions which are linear combinations of two dimensional polynomial functions. The translational motion model is the simplest model and requires only two coefficients to describe the motion vectors of each segment. The values of motion vectors are given by the formulae:Δx(x,y)=a0Δy(x,y)=b0  (6)
This model is widely used in various international standards (ISO MPEG-1, MPEG-2, MPEG-4, ITU-T Recommendations H.261 and H.263) to describe motion of 16×16 and 8×8 pixel blocks. Systems utilizing a translational motion model typically perform motion estimation at full pixel resolution or some integer fraction of full pixel resolution, for example with an accuracy of ½ or ⅓ pixel resolution.
Two other widely used models are the affine motion model given by the equation:Δx(x,y)=a0+a1x+a2yΔy(x,y)=b0+b1x+b2y  (7)and the quadratic motion model given by the equation:Δx(x,y)=a0+a1x+a2y+a3xy+a4x2+a5y2Δy(x,y)=b0+b1x+b2y+b3xy+b4x2+b5y2  (8)
The affine motion model presents a very convenient trade-off between the number of motion coefficients and prediction performance. It is capable of representing some of the common real-life motion types such as translation, rotation, zoom and shear with only a few coefficients. The quadratic motion model provides good prediction performance, but it is less popular in coding than the affine model since it uses more motion coefficients, while the prediction performance is not substantially better than, for example, that of the affine motion model. Furthermore, it is computationally more costly to estimate the quadratic motion than to estimate the affine motion.
The Motion Field Estimation block 11 calculates initial motion coefficients a0i, . . . , ani, b0i, . . . , bni for [Δx(x,y),Δy(xy)] a given segment Sk, which initial motion coefficients minimize some measure of prediction error in the segment. In the simplest case, the motion field estimation uses the current frame In(x,y) and the reference frame Rn(x,y) as input values. Typically the Motion Field Estimation block outputs the [Δx(x,y),Δy(x,y)] initial motion coefficients to the Motion Field Coding block 12.
The segmentation of the current frame into segments Sk can, for example, be carried out in such a way that each segment corresponds to a certain object moving in the video sequence, but this kind of segmentation is a very complex procedure. A typical and computationally less complex way to segment a video frame is to divide it into macroblocks and to further divide the macroblocks into rectangular blocks. In his description tern macroblock refers generally to a part of a video frame. An example of a macroblock is the previously described YUV macroblock. FIG. 3 presents an example, where a video frame 30 is to divided into macroblocks 31 having a certain number of pixels. Depending on the encoding method, there may be many possible macroblock segmentations. FIG. 3 presents a case, where there are four possible ways to segment a macroblock: macroblock 31A is segmented into blocks 32, macroblock 31B is segmented with a horizontal dividing line into blocks 33, and macroblock 31C is segmented with a vertical dividing line into blocks 34. The fourth possible segmentation is to treat a macroblock as a single block. The macroblock segmentations presented in FIG. 3 are given as examples; they are by no means an exhaustive listing of possible or feasible macroblock segmentations.
The Motion Field Coding block 12 makes the final decisions on what kind of motion vector field is transmitted to the decoder and how the motion vector field is coded. It can modify the segmentation of the current frame, the motion model and motion coefficients in order to minimize the amount of information needed to describe a satisfactory motion vector field. The decision on segmentation is typically carried out by estimating a cost of each alternative macroblock segmentation and by choosing the one yielding the smallest cost. As a measure of cost, the most commonly used is a Lagrangian cost functionL(Sk)=D(Sk)+λR(Sk),which links a measure of the reconstruction error D(Sk) with a measure of bits needed for transmission R(Sk) using a Lagrangian multiple λ. The Lagrangian cost represents a trade-off between the quality of transmitted video information and the bandwidth needed in transmission. In general a better image quality, i.e. small D(Sk), requires a larger amount of transmitted information, i.e. large R(Sk).
In present systems, which utilize a translational motion model, prediction motion coefficients are typically formed by calculating the median of surrounding, already transmitted motion coefficients. This method achieves fairly good performance in terms of efficient use of transmission bandwidth and image quality. The main advantage of this method is that the prediction of motion coefficients is straightforward.
The more accurately the prediction motion coefficients correspond to the motion coefficients of the segment being predicted, the fewer bits are needed to transmit information about the refinement motion field. It is possible to select, for example among the neighboring blocks, the block whose motion coefficient are closest the motion coefficients of the block being predicted. The segment selected for the prediction is signaled to the decoder. The main drawback of this method is that finding the best prediction candidate among the already transmitted image segments is a complex task: the encoder has to perform exhaustive calculations to evaluate all the possible prediction candidates and then select the best prediction block. This procedure has to be carried out separately for each block,
There are systems where the transmission capacity for the compressed video stream is very limited and where the encoding of video information should not be too complicated. For example, wireless mobile terminals have limited space for additional components and as they operate by battery, they typically cannot provide computing capacity comparable to that of desktop computers. In radio access networks of cellular systems, the available transmission capacity for a video stream can be as low as 20 kbps. Consequently, there is need for a video encoding method, which is computationally simple, provides good image quality and achieves good performance in terms of required transmission bandwidth. Furthermore, to keep the encoding method computationally simple, the encoding method should provide satisfactory results using simple motion models.