The present invention relates to a video coding system. In particular, it relates to a system for the compression of video sequences using motion compensated prediction.
The schematic diagram of a system using motion compensated prediction is shown in FIG. 1 and FIG. 2 of the accompanying drawings. FIG. 1 illustrates an encoder having a motion estimation block and FIG. 2 illustrates a corresponding decoder. Motion compensated prediction in such a system is outlined below.
In typical video sequences the change of the content of successive frames is to a great extent the result of the motion in the scene. This motion may be due to camera motion or due to motion of the objects depicted in the scene. Therefore typical video sequences are characterized by significant temporal correlation, which is highest along the trajectory of the motion. Efficient compression of video sequences requires exploitation of this property of video sequences.
Motion Compensated (MC) prediction is a widely recognized technique for compression of video. It utilizes the fact that in typical video sequence, image intensity value in a particular frame can be predicted using image intensities of some other already coded and transmitted frame, given motion trajectory between these two frames.
The operating principle of motion compensated video coders is to minimize the prediction error En(x,y), i.e., the difference between the frame being coded In(x,y) called the current frame and the prediction frame Pn(x,y) (FIG. 1):
En(x,y)=In(x,y)xe2x88x92Pn(x,y)xe2x80x83xe2x80x83(1) 
The prediction error En(x,y) is compressed and the compression process typically introduces some loss of information. The compressed prediction error denoted {overscore (E)}n(x,y) is sent to the decoder. Prediction frame Pn(x,y) is constructed by the motion compensated prediction block in FIG. 1 and FIG. 2. The prediction frame is built using pixel values of the reference frame denoted Rn(x,y) and the motion vectors of pixels between the current frame and the reference frame using formula
Pn(x,y)=Rn[x+xcex94x(x,y), y+xcex94y(x,y)].xe2x80x83xe2x80x83(2) 
Reference frame is one of previously coded and transmitted frames (e.g. frame preceding the one being coded) which at a given instant is available in the Frame Memory of the encoder and of the decoder. The pair of numbers [xcex94x(x,y), xcex94y(x,y)] is called the motion vector of the pixel in location (x,y) in the current frame. xcex94x(x,y) and xcex94y(x,y) are the values of horizontal and vertical displacements of this pixel, respectively. Motion vectors are calculated by the motion estimation block in the encoder shown in FIG. 1. The set of motion vectors of all pixels of the current frame [xcex94x(xc2x7), xcex94y(xc2x7)] is called motion vector field and is transmitted to the decoder.
In the decoder, pixels of the coded current frame {overscore (I)}n(x,y) are reconstructed by finding the prediction pixels in the reference frame Rn(xc2x7) using the received motion vectors and by adding the received prediction error {overscore (E)}n(x,y), i.e.,
{overscore (I)}n(x,y)=Rn[x+xcex94x(x,y),y+xcex94y(x,y)]+{overscore (E)}n(x,y)xe2x80x83xe2x80x83(3) 
For example, if the transmission channel available for the compressed video bit stream is very narrow, it is possible to reject the effect of prediction errors. Then it is not necessary to compress and transmit the prediction error, and the spare bits from the transmission channel and spare calculation power can be used for other purposes, e.g., to improve the frame rate of the video signal. The rejection of prediction errors leads to defective pixel elements in the visible video picture, but depending on the demands of the application in use it may be acceptable.
Due to the very large number of pixels in the frame it is not efficient to transmit a separate motion vector for each pixel. Instead, in most video coding schemes the current frame is divided into larger image segments so that all motion vectors of the segment can be described by few coefficients. Depending on the way the current frame is divided into the segments two types of motion compensated coders can be distinguished:
1 Block based coders where the current frame is divided into fixed and a priori known blocks, e.g., 16xc3x9716 pixels blocks in international standard ISO/IEC MPEG-1 or ITU-T H.261 codecs (FIG. 3a).
2 Segmentation based (region based) coders where the current frame is divided into arbitrarily shaped segments, e.g., obtained by a segmentation algorithm (FIG. 3b). (For examples refer to Centre de Morphologie Mathematique (CMM), xe2x80x9cSegmentation algorithm by multicriteria region merging,xe2x80x9d Document SIM(95)19, COST 211ter Project Meeting, May 1995 and P. Cicconi and H. Nicolas, xe2x80x9cEfficient region-based motion estimation and symmetry oriented segmentation for image sequence coding,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, Vol. 4, No. 3, June 1994, pp. 357-364)
In practice segments include at least few tens of pixels. In order to represent the motion vectors of these pixels compactly it is desirable that their values are described by a function of few coefficients. Such function is called motion vector field model.
Motion compensated video coding schemes may define the motion vectors of image segments by the following general formula:                               Δ          ⁢                      xe2x80x83                    ⁢                      x            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      N              -              1                                ⁢                                    a              i                        ⁢                                          f                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        4        )                                          Δ          ⁢                      xe2x80x83                    ⁢                      y            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      M              -              1                                ⁢                                    b                              i                ⁢                                  xe2x80x83                                                      ⁢                                          g                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        5        )            
where coefficients ai and bi are called motion coefficients and are transmitted to the decoder. Functions fi and gi are called motion field basis functions and have to be known both to the encoder and decoder.
Polynomial motion models are a widely used family of models. (See, for example H. Nguyen and E. Dubois, xe2x80x9cRepresentation of motion information for image coding,xe2x80x9d in Proc. Picture Coding Symposium ""90, Cambridge, Mass., Mar. 26-18, 1990, pp. 841-845 and Centre de Morphologie Mathematique (CMM), xe2x80x9cSegmentation algorithm by multicriteria region merging,xe2x80x9d Document SIM(95)19, COST 211ter Project Meeting, May 1995). The values of motion vectors are described by functions which are linear combinations of 2D polynomial functions, the translational motion model is the simplest model and requires only two coefficients to describe motion vectors of each segment. The values of motion vectors are given by formula
xcex94x(x,y)=a0 
xcex94y(x,y)=b0xe2x80x83xe2x80x83(6) 
This model is used in international standards (ISO MPEG-1, ITU-T Recommendation H.261) to describe motion of fixed 16xc3x9716 blocks. Two other widely used models are affine motion model given by the equation:
xcex94x(x,y)=a0+a1x+a2y 
xcex94y(x,y)=b0+b1x+b2yxe2x80x83xe2x80x83(7) 
and quadratic motion model given by the equation:
xcex94x(x,y)=a0+a1x+a2y+a3xy+a4x2+a5y2 
xcex94y(x,y)=b0+b1x+b2y+b3xy+b4x2+b5y2xe2x80x83xe2x80x83(8) 
The Motion Estimation block calculates motion vectors [xcex94x(x,y),xcex94y(x,y)] of the pixels of a given segment Sk which minimize some measure of prediction error in this segment. A meaningful additive measure of prediction error has the form                               ∑                                    (                                                x                  i                                ⁢                                  y                  i                                            )                        ∈                          S              k                                      ⁢                              p            i                    ⁢                      h            ⁡                          (                              "LeftBracketingBar"                                                                            I                      n                                        ⁡                                          (                                              x                        ,                        y                                            )                                                        -                                                            R                      n                                        ⁡                                          (                                                                        x                          +                                                      Δ                            ⁢                                                          xe2x80x83                                                        ⁢                                                          x                              ⁡                                                              (                                                                  x                                  ,                                  y                                                                )                                                                                                                                    ,                                                  y                          +                                                      Δ                            ⁢                                                          xe2x80x83                                                        ⁢                                                          y                              ⁡                                                              (                                                                  x                                  ,                                  y                                                                )                                                                                                                                                        )                                                                      "RightBracketingBar"                            )                                                          (        9        )            
where pi""s are scalar constants, |.| denotes absolute value, and h is a non-decreasing function. A very popular measure is the square prediction error, in which case pi=1, and h(.)=(.)2:                               ∑                                    (                                                x                  i                                ,                                  y                  i                                            )                        ∈                          S              k                                      ⁢                              (                                                            I                  n                                ⁡                                  (                                      x                    ,                    y                                    )                                            -                                                R                  n                                ⁡                                  (                                                            x                      +                                              Δ                        ⁢                                                  xe2x80x83                                                ⁢                                                  x                          ⁡                                                      (                                                          x                              ,                              y                                                        )                                                                                                                ,                                          y                      +                                              Δ                        ⁢                                                  xe2x80x83                                                ⁢                                                  y                          ⁡                                                      (                                                          x                              ,                              y                                                        )                                                                                                                                )                                                      )                    2                                    (        10        )            
The function to minimize (9) is highly nonlinear and there is thus no practical technique which is capable of always finding the absolute minimum of (9) in finite time. Accordingly, motion estimation techniques differ depending on the algorithm for minimization of the chosen measure of prediction error.
Previously known techniques for motion estimation are discussed below.
One technique is the full search. In this technique the value of the cost function is calculated for all the possible combinations of allowed values of the motion coefficients (restricted by the range and precision with which motion coefficients can be represented). The values of motion coefficients for which the cost function is minimized are chosen to represent the motion vector field.
The full search technique is usually used only to estimate motion coefficients of translational motion model and cannot be straightforwardly generalized for other motion models, due to computational burden. In a straight forward generalization, the computational complexity of the algorithm is exponentially increased by the number of motion coefficients used to represent the motion vector field.
Motion estimation using Gauss-newton iterations (or differential optimization schemes) is an alternative. These are outlined in H. Sanson, xe2x80x9cRegion based motion estimation and compensation for digital TV sequence coding,xe2x80x9d in Proc. Picture Coding Symposium xc3x9793, Lausanne, Switzerland. Mar. 17-19, 1993 and C. A. Papadoupoulos, and T. G. Clarkson, xe2x80x9cMotion Compensation Using Second-Order Geometric Transformationsxe2x80x9d, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 4, August 1995, pp. 319-331. Such techniques, use the well-known Gauss-Newton function minimization algorithm, to minimize the cost function (9), i.e. the chosen measure of prediction error, as a function of motion coefficients.
This algorithm assumes that the function to be minimized can be locally approximated by a quadratic function of the arguments. Then, the nth iteration step consists of
1. computing the approximate quadratic function using first and second derivatives of the actual function using the motion coefficient resulting from (nxe2x88x921)th step,
2. computing the coefficient vector minimizing the approximate function, and assigning the result to the motion coefficient of nth step.
The main problem associated with the Gauss-Newton algorithm is that it converges only towards local minima, unless the initial motion coefficients lie in the attraction domain of the global minimum. Thus it is necessary to feed the Gauss-Newton algorithm with a sufficiently good initial guess of the actual optimum. Two different techniques are usually used to improve the convergence of the Gauss-Newton algorithm:
1. motion estimation using multiresolution image pyramids,
2 motion estimation using hierarchically increasing levels of the motion model.
The technique of motion estimation using multiresolution image pyramids is based on the assumption that low-pass filtering the current frame and the reference frame will erase the local minima and help the algorithm to converge to the global minimum. Motion estimation is performed first on the low-pass filtered (smoothed) versions of the reference and current frames, and the result is fed as input to the motion estimation stages using less smoothed images. The final estimation is performed on non-smoothed images. Some vanants of this class additionally down-sample the low-pass filtered images, to reduce the amount of computations. (For examples of this technique, see H. Sanson, xe2x80x9cRegion based motion estimation and compensation for digital TV sequence coding,xe2x80x9d in Proc. Picture Coding Symposium ""93, Lausanne, Switzerland, Mar. 17-19, 1993 and P. J. Burt, xe2x80x9cThe Pyramid as a Structure for Efficient Computationxe2x80x9d, in: Multiresolution Image Processing and Analysis, ed. Rosenfeld, Springer Vedag, 1984, pp. 6-35).
However, low-pass filtering of the images does not necessarily erase local minima. Furthermore, this may shift the location of global minimum,
Down-sampling of filtered images can cause aliasing. Moreover, convergence becomes more difficult due to the reduction of number of pixels in the region.
By contrast, the technique of motion estimation using hierarchically increasing levels of the motion model makes use of the following assumptions:
1. A complex motion model can be approximated by a lower order motion model.
2. This approximation is a good initial guess for the iterative search of more complex motion model coefficients.
The most common hierarchy is starting with the translational model (2 coefficients), then continuing with a simplified linear model (corresponding to the physical motion of translation, rotation and zoom, having 4 coefficients), and then going to the complete linear model (6 coefficients), etc. (Such hierarchy can be seen in P. Cicconi and H. Nicolas, xe2x80x9cEfficient region-based motion estimation and symmetry oriented segmentation for image sequence coding,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, Vol. 4, No. 3. June 1994, pp. 357-364 and H. Nicolas and C. Labit, xe2x80x9cRegion-based motion estimation using deterministic relaxation schemes for image sequence coding,xe2x80x9d in Proc. 1994 International Conference on Acoustics, Speech and Signal Processing, pp. 111265-268.)
These assumptions can work very well under certain conditions. However, convergence to a local minimum is often a problem, especially in the case when the approximation turns out to be a poor one.
Present systems, such as those outlined above, suffer from disadvantages resulting from the relationship between computational complexity and video compression performance. That is, on the one hand, an encoder will require a motion estimation block having high computational complexity in order to determine motion coefficients which minimize the chosen measure of prediction error (9) and thus achieve the appropriate video compression performance. Usually, in this case such a motion estimation block poses as a bottleneck for computational complexity of the overall encoder, due to the huge number of computations required to achieve the solution.
On the other hand, if computational complexity is reduced, it will result In a reduction in prediction performance, and thus video compression performance. Since the prediction performance of motion estimation heavily affects the overall compression performance of the encoder, it is crucial for a motion estimation algorithm to have high prediction performance (i.e. low prediction error) with relatively low complexity.
The aforementioned disadvantage is overcome, and other benefits are provided, by the provision of a system according to the present invention in which the encoder comprises at least two different operation modes, and the transmitter further comprises result verification means to select the operation mode of the encoding means.
To keep the complexity low, motion estimation algorithms have to make assumptions about the image data, motion, and prediction error. The more these assumptions hold statistically, and the stronger the assumptions are, the better the algorithm is. Different sets of assumptions usually result in different minimization techniques.
The system according to the present invention achieves statistically low prediction error with relatively little complexity by dynamically switching between statistically valid assumptions varying in strength.
The present invention relates to a system where a digitized video signal is transmitted from a transmitter to a receiver, said transmitter comprises encoding means to compress the video signal to be transmitted, said receiver comprises decoding means to decode the compressed video signal, said encoding means comprise at least two different operation modes, and said transmitter further comprises result verification means to select the operation mode of said encoding means.
The present invention further relates to a corresponding system where the encoding means comprise motion estimation means having at least two different operation modes and the result verification means have been arranged to select the operation mode of said motion estimation means.
According to one aspect of the present invention, there is provided a motion estimation system for a video coder comprising:means for receiving a video frame to be coded; a series of motion estimators of varying complexity, for estimating a motion vector field between the said received frame and a reference frame; and control means for selecting the subsequent motion estimator in the series only if a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold.
The series of motion estimators may comprise a series of motion models, a series of minimisers, or a combination of both. For example, it may comprise a hierarchy of motion models in order of complexity (e.g zero motion model, translational motion model, affine motion model, and quadratic motion model). Alternatively, it may comprise a series of minimisers, for example for one particular model (e.g. for a linear model, the series of minimisers could be Quasi-Newton and Newton (proper and/or Taylor expanded). Another option is a combination of motion models and minimisers (e.g. such as that shown in FIG. 8 of the accompanying drawings).
The predetermined threshold may differ, at different stages of minimisation and/or for different models. (For example see FIG. 8 of the accompanying drawings).
The motion estimation system may further comprise means for smoothing the received frame. Such smoothing means may take the form of a series of low pass filters. Preferably, the control means selects the level of smoothing during minimisation depending on the change in prediction error, For example, if the change is below a threshold for a certain level of smoothing, then at least the next level of smoothing may be jumped. Alternatively, if the change is at, or above, the threshold, then the next level of smoothing may be selected.
There may also be provided a video coder comprising a motion estimation system according to the present invention.
According to another aspect of the present invention, there is provided a motion estimation method for coding a video frame, comprising:receiving a video frame to be coded; estimating a motion vector field between the said received frame and a reference frame, using a motion estimator from a series of motion estimators of varying complexity; determining whether a prediction error associated with the motion vector field estimated by the currently selected motion estimator exceeds a predetermined threshold and, only if so, selecting the subsequent motion estimator in the series.