The invention relates to a method of estimating motion between images forming a sequence P(t-n), P(t-n+1), . . . , P(t-2), P(t-1), P(t), . . . , and available in the form of a sequence S(t-n), S(t-n+1), . . . , S(t-2), S(t-1), S(t), . . . , of segmented images or partitions composed of I regions R.sub.i identified by labels, said method comprising, for supplying for each region R.sub.i an information M.sub.i (t) representative of the motion of the current image P(t) with respect to the previous image P(t-1), the following three operations, performed for each region of said current image:
(1) a first step of initializing motion parameters of each region R.sub.i of P(t) as a function of the images P(t-1), P(t) before segmentation and S(t-1), S(t) after segmentation, and of the motion information M.sub.i (t-1) estimated for the previous image P(t-1) in a previous performance of the method; PA1 (2) a second step for an intermediate processing of the images on which the estimation of the motion is performed, and a third refining step for the definitive determination of said motion parameters in the form of a vector (Dx, Dy) for all the pixels of each of the regions R.sub.i, in such a way that, for each coordinate point (x,y) of the region, L(x,y,t)=L(x-Dx, y-Dy, t-1), L(.) designating the luminance or a more complex video signal and Dx, Dy being polynomials the degree of which is related to the type of motion of the region; PA1 (3) the iterative repetition of said second and third steps of intermediate processing and refinement, until the end of this iterative process as a function of at least a given criterion so as to finally obtain the motion information M.sub.i (t). PA1 (a) two parameters are sufficient for defining the translation of a plane facet parallel to the image in a plane parallel to the image: EQU Dx=a.sub.1 EQU Dy=a.sub.2 PA1 (b) for a motion of the zoom and/or panning type, four parameters are necessary for modelling the motion of translation of a plane facet parallel to the plane of the image if the facet has an arbitrary orientation, or has an arbitrary translation motion of this facet if it is parallel to the plane of the image: EQU Dx=a.sub.1 +a.sub.2 x+a.sub.3 y EQU Dy=a.sub.4 -a.sub.3 x+a.sub.2 y PA1 (c) for a related transform, six parameters are necessary for modelling a translation motion as indicated under (b) above or a rotational motion of a plane facet around an axis perpendicular to the plane of the image: EQU Dx=a.sub.1 +a.sub.2 x+a.sub.3 y EQU Dy=a.sub.4 +a.sub.5 x+a.sub.6 y PA1 (d) for a quadratic motion, twelve parameters are necessary for modelling arbitrary rotations and translations of the curved facets: EQU Dx=a.sub.1 +a.sub.2 x+a.sub.3 y+a.sub.4 x.sup.2 +a.sub.5 xy+a.sub.6 y.sup.2 EQU Dy=a.sub.7 +a.sub.8 x+a.sub.9 y+a.sub.10 x.sup.2 +a.sub.11 xy+a.sub.12 y.sup.2 PA1 (A) exploiting at best the information which is initially available at the input, and possibly gathering new information for deducing probable motion hypotheses; PA1 (B) for each of said motion hypotheses (expressed hereinafter) and based on known data for the image P(t-1), predicting the region concerned in P(t) and computing the corresponding prediction error; PA1 (C) selecting, as initial values of the motion parameters, those values which generate the smallest prediction error (which simultaneously validates one specific motion hypothesis). PA1 (a) the original images P(t-1) and P(t); PA1 (b) the images of the labels S(t-1) and S(t); PA1 (c) the data M.sub.i (t-1), i.e. the motion information (type of motion and values of corresponding parameters), including the case where the motion is zero, permitting, during the previous cycle (i.e. based on the image P(t-2)) to know the motions which lead to the image P(t-1); PA1 (d) the data resulting from performing the BMA technique (described above) between the images P(t-1) and P(t), namely a displacement vector defined for each pixel of P(t), and in general at about one pixel (or possibly at about half a pixel). PA1 (1) the motion of the region R.sub.i is zero; PA1 (2) the label i considered already existed in S(t-1) and the motion of the region is only a translation parallel to the image plane: it is then sufficient to compute the coordinates of the center of gravity of the label i in S(t-1) and S(t) and then the difference between these coordinates, which yields the displacement vector; PA1 (3) the label i considered was already existing in S(t-1): j designating each label near i in S(t-1) and the data M.sub.i (t-1) and M.sub.j (t-1) being expressed in the local reference associated with i, all the labels j neighboring i in S(t-1) are searched and then the corresponding data M.sub.j (t) are read (type of motion and values of corresponding parameters converted in this local reference), and the best motion between the label i and its neighboring labels is chosen; PA1 (4) the considered motion is that which corresponds to the best approximation of the field of displacement vectors resulting from an adaptation of the BMA technique per region (only the displacements computed for blocks of more than half the number included in the region i considered being taken into account), said adaptation consisting of a sequence of translational motion estimations of blocks having a variable size and of relaxations so as to subsequently determine an approximation of the field of vectors by means of a more complex model and with the aid of a regression technique (this method of adapting the polynomial from a set of values is similar to the polynomial approximation method described, for example for encoding textured images, in the article "Coding of arbitrarily shaped image segments based on a generalized orthogonal transform", by M. Gilge, T. Engelhardt and R. Mehlan, "Signal Processing: Image Communication", Vol. 1, No. 2, October 1989, pp. 153-180, which example is not limitative). PA1 (a) computation of the non-integral coordinates of this pixel at the instant (t-1): PA1 (b) computation of the luminance and its coordinates in P(t-1): the luminance is interpolated in the present case by using a bicubic monodimensional filter having a length 5 with a precision of 1/16th of a pixel (the interpolation is effected horizontally and vertically with the same filter, by means of an operation referred to as mirroring at the edges of the image in the case of pixels at the edge of the image so as to obtain two luminance values, the mean value of which is preserved).
The invention also relates to a device for implementing said method.
The invention is particularly suitable for encoding video signals in the field of very low bitrates and in the field of low bitrates up to approximately 1 Mbit/second. This range of bitrates notably corresponds to consumer applications, often termed as multimedia applications.
The european patent application EP 0771115 (PHF96534) describes a method and device the main characteristics of which will be hereunder recalled. Before doing that, the notations used throughout the description are first indicated. The images here considered form part of an original sequence of textured images, denoted P(t-n), P(t-n+1), . . . , P(t-2), P(t-1), P(t), etc. In the description which follows, the focus is principally on P(t), the current image, and on P(t-1) which is the previous original image (or the previous encoded image, in accordance with the characteristics of the encoding process performed after the motion estimation). From a practical point of view, these two textured images P(t-1) and P(t), between which the motion estimation is carried out at the time t, are the images of the luminance signal in this case, but may also correspond either to a combination of luminance and chrominance signals, in case the color contains much specific information, or to any other transform of original images restituting the information of the signal. The value of the luminance at an arbitrary point (x,y) in one of the images P(t-2), P(t-1), P(t), etc. is denoted L(x,y,t-2), L(x,y,t-1), L(x,y,t), etc. Concerning the operation of segmenting the images, a sequence of images of labels (these images are also referred to as partitions) corresponds to the sequence of original images, while the segmented images are referred to as S(t-2), S(t-1), S(t), etc . . . and correspond to the original textured images P(t-2), P(t-1), P(t), etc . . . , and finally form a sequence of a certain type of images resulting from a pre-analysis which is required to carry out the motion estimation method. FIG. 1 illustrates an example of segmenting an image into seven regions R.sub.i, with i=0 to 6.
The information relating to the motion of the current image P(t) with respect to the previous image P(t-1) is arranged under the reference M.sub.i (t) for the region R.sub.i of the image P(t). This reference M.sub.i (t) includes the data constituted by the type of motion retained (i.e. the order or degree of the polynomial representing this motion) and the values of the corresponding parameters (i.e. the values of the coefficients of the polynomials). For example, as illustrated in FIG. 2, a displacement vector (Dx,Dy) from P(t-1) to P(t), with Dx and Dy being polynomials whose coefficients are the motion parameters, corresponds to a type of motion determined between the images P(t-1) and P(t) at a point (x,y) of a region R.sub.i of P(t). This can be written as L(x,y,t)=L(x-Dx,y-Dy,t-1). The degree of these polynomials (0, 1 or 2) and the number of coefficients defining them (from 2 to 12 parameters) depend on the type of motion considered:
These polynomial models have been adopted because it is possible to show that they represent the motion of objects in a satisfactory manner. However, they cannot be interpreted as a strict description of the real three-dimensional motion of these objects. For this purpose, it is necessary to have the certainty that the objects are rigid and that their form is also known, which is not the case. The models in question are thus simply a representation of the deformation of the projection of the objects in the image plane (for example, in the case of two parameters, the model effectively describes a translation in the image plane, supposing that the object concerned is rigid and has a plane surface which is parallel to the image plane). A detailed representation of these motion models is given, for example in the article "Differential methods for the identification of 2D and 3D motion models in image sequences", J. L. Dugelay and H. Sanson, Signal Processing: Image Communication, Vol. 7, No. 1, March 1995, pp. 105-127.
The coordinates of a point in an image are denoted throughout the description in capitals (X,Y) or not (x,y), dependent on whether they are expressed with respect to a global reference related only to the image or with respect to a local reference related to a given region of the image.
It is here useful also to recall that, in this case, the objective of motion estimation is to provide the possibility of subsequently restoring a predicted image R(t) constituting an approximation of P(t), based on the segmented image S(t), on the previously restored image R(t-1) corresponding to P(t-1), and on information M.sub.i (t) obtained during the motion estimation. Such a determination of R(t) provides, inter alia, the subsequent possibility of encoding the prediction error only, i.e. the difference between P(t) and R(t).
The method presented in the document previously cited can now be described in detail with reference to FIG. 3. It comprises successively a first step 10 of initializing the motion parameters (INIT), a second pre-processing step 20 (PPRO), and a third step 30 of refining the motion parameters (REFT), each one being performed for each region of the current image.
The first step 10 of initializing the motion parameters has for its object to start the motion estimation process with motion parameter values for each region R.sub.i of the image P(t) considered, which values are as close as possible to the final real values of these parameters so as to be able to suppose, throughout the processing operations envisaged, that the motion parameters have small variations. Moreover, the required execution time for obtaining a given estimation quality is, on average, shorter if one starts the estimation process with parameter values which are closer to the searched real values, which additional execution time of said first initialization step is negligible with respect to that of the estimation itself. It should be prevented that in the course of these processing operations, which, as will be seen, are implemented in an iterative manner, the convergence can be produced at a local minimum, which would have more chances of being produced if the initial values were too remote from said searched real values.
In three sub-steps referenced (A), (B) and (C) and performed for each region R.sub.i of the image P(t), the first step thus consists of:
The first sub-step (A) of the step 10 of initialization INIT consists of exploiting the initial data, which are:
For each region R.sub.i of P(t), four motion hypotheses with respect to the previous image have successively been taken into consideration in this case, taking their complementarity and simplicity of formulation into account in view of the available information:
The second sub-step (B) of the step 10 for initialization INIT consists of predicting, on the basis of P(t-1), the corresponding region in P(t), taking into account each motion hypothesis effected during the sub-step (A), and of subsequently computing each time the prediction error for the region. The following prediction principle is used: with P(t-1), S(t) and M.sub.i (t) being known, the predicted luminance value in the image P(t) is determined at a position (X,Y) of a pixel. The detailed description of the prediction will be given hereinafter in the part dealing with the refinement, for the definitive estimation of the motion parameters.
Finally, the third sub-step (C) of the step 10 for initialization INIT consists of comparing the computed prediction errors in a region and of preserving, as initial motion of the region, that one to which the smallest prediction error corresponds. The process is repeated for each region and the first initialization step 10 is ended when the motion parameters have thus been adjusted for all the regions of P(t). The set of initial parameters thus determined for a region R.sub.i is designated by the reference ##EQU1## in FIG. 3.
The second intermediate processing step 20 has for its object to facilitate the estimation of definitive motion parameters obtained at the end of the third and last step. Without this being the only possibility, an essential processing operation for obtaining this objective is to modify the luminance signal so as to bring it closer to the theoretical ideal (a first-order function) i.e. so as to verify the mathematical hypothesis required by the theory in order to obtain a convergence of the estimation process. This processing operation consists of a filtering of P(t-1) and of P(t), for example by using an isotropic Gaussian filter in the four directions of the plane (S(t-1), S(t), M.sub.i (t) are not modified). This choice of filter ensures a very good compromise between a smoothing of the contours, which is useful for simplifying the luminance signal in the image and facilitating the convergence by avoiding the local minima as much as possible, and the maintenance of a sufficient localization of these contours in the image (it is desirable to preserve enough details of the image in order that the precision of the estimated motion is sufficient). The filtered images are designated by the references P'(t-1) and P'(t) in FIG. 3. The set of motion parameters available after this preprocessing operation realized during step 20 is designated in FIG. 3 by the reference M.sub.23.sup.i (t).
The third step 30 of refining the motion parameters, which step is iterative, has for its object to effect the definitive estimation of the motion parameters for the region concerned. The iterative process performed during this step ends at a given criterion, for example when a number of iterations fixed in advance is reached (other criteria may also be proposed, such as a stop of iterations when a sufficient quality during motion compensation permitting the previous motion estimation is reached or when the improvement for the new iteration becomes negligible, or even a combination of several criteria).
It should be here recalled that, for each region of P(t), a vector (Dx,Dy) is searched so that, for each point in the region, L(x,y,t)=L(x-Dx,y-Dy,t-1), in which Dx and Dy are polynomials of a degree related to the type of motion for the region considered. The equality between these two terms L(.) can only be realized in a more or less approximative manner in accordance with the degree of quality of the motion estimation. In order that this approximation is as satisfactory as possible, the criterion used is the one for determining the smallest quadratic error: the sum of the square values of the prediction errors in the pixels of the region is minimized, i.e. the following expression: EQU .SIGMA.(L(x,y,t)-L(x-Dx,y-Dy,t-1)).sup.2 (1)
for x and y taking all the values corresponding to the coordinates of the pixels in the region R.sub.i considered. This expression (1) is denoted in an abbreviated form in the following manner (DFD=Displaced Frame Difference): EQU .SIGMA..sub.x,y (DFD(x,y,Dx,Dy)).sup.2 (2)
It is known that such a mathematical minimizing operation (of expression (2)) may be notably effected by means of the Gauss-Newton method for which Dx=(Dx.sub.o +dx) and Dy=(Dy.sub.o +dy), with dx and dy being very small with respect to Dx.sub.o and Dy.sub.o respectively. By first-order development, we obtain: ##EQU2## The expression (1) to be minimized thus will be: ##EQU3## In this expression (4), the derivatives with respect to each of the coefficients of dx and dy are made equal to 0 eliminated in order to characterize the minimum of this expression with respect to the motion parameters, which leads to a set of n equations with n unknown quantities.
The solutions of this set of equations are the variations of the motion parameters leading to the smallest quadratic error. To resolve this, it is expressed in a matrix form: EQU [A].[x]=[B] (5)
The vector x represents the searched parameters and the terms of the matrices depend on the coordinates of the pixels of the current image, on the gradients which are horizontal and vertical to the previous positions of the pixels (in the previous image) and on the luminance values at these positions in the current and the previous image. For each region R.sub.i of S(t) and at each iteration, the matrices A and B must be constructed, the matrix A must be inverted and the inverse matrix thus obtained must be multiplied by the matrix B to obtain the vector solution x: the motion information (for the region R.sub.i) may then be updated by adding components of this vector solution x to the previous expression of this motion information.
For constructing the matrix A, the luminance values must be computed at the points in the previous image corresponding to previous positions of the points in the current image, of which the motion and the positions in this current image are known--it is the prediction operation described below--and the values of the horizontal gradient and the vertical gradient at similar points must be computed. The computations must be performed on values of parameters expressed in their local reference (i.e. related to the region). For each region, two sets of parameters are stored, on the one hand, the parameters during estimation, while converging, which is denoted M.sub.i.sup.cv (t), and, on the other hand, the parameters giving the best results for the region, which is denoted M.sub.i.sup.f (t). At the start of the refinement step 30, these two motions are equal to the initial motion ##EQU4## originating from the initialization of the processed parameters. Subsequently, M.sub.i.sup.cv (t) is refined in an iterative manner and substituted at the end of each iteration for M.sub.i.sup.f (t), which corresponds to the motion giving the best results among the following motions for the region under study: either the parameters M.sub.i.sup.cv (t) correctly computed for the current region, or the parameters M.sub.i.sup.cv (t) of the regions j neighboring i in S(t), which motions are reconverted in the local reference corresponding to the region i. Finally, this retained motion may give rise to a propagation towards the neighboring regions: for these neighboring regions, the search for the smallest prediction error on the basis of this motion is resumed, which is selected if it effectively leads to a smaller prediction error, and so forth. At the output of each iteration of the refinement step, the motion information determined for each region i of S(t) is sent towards the input of step 20 (parameters designated by ##EQU5## in FIG. 3).
The prediction operation which is necessary for the construction of the matrix A is now described. Given a pixel of coordinates (X,Y), the prediction enables the determination of the predicted luminance value at this position at the instant t, denoted L(X,Y,t), based on S(t), P(t-1) and M(t). This operation, which is performed at each point of the image, consists of two steps:
search of the label i of the region which the pixel belongs to by reading the image of the labels S(t) at the position (X,Y); PA2 for this pixel, selection of its motion information (type of motion and value of the parameters) by reading M.sub.i (t) for this label; PA2 computation of the displacement (Dx,Dy) of the pixel as a function of its coordinates, of the type of motion and the value of the parameters of its region (for example, in the case of a related motion, for which there are 6 parameters, one has: (Dx,Dy)=(a.sub.1 +a.sub.2.X+a.sub.3.Y,a.sub.4 +a.sub.5.X+a.sub.6.Y) if the motion parameters are expressed with respect to the global reference, or (dx,dy)=(a.sub.1 +a.sub.2.x+a.sub.3.y,a.sub.4 +a.sub.5.x+a.sub.6.y) if the motion parameters are expressed with respect to a local reference at their region), said displacement thus providing the possibility for this pixel of deducing its coordinates at (t-1): (X-Dx,Y-Dy) (if these coordinates are outside the image, one takes the coordinates of the closest point associated with the image, but the coordinates found are not necessarily integral values because the parameters are not, and thus an interpolation must be carried out for deducing the luminance at this point);
For the computations of the gradient, it is necessary to interpolate their values, similarly as for the luminance. To ensure the coherence of this operation with the interpolation used during prediction, the filter used is derived from that used for the luminance, of length 5, with a precision at 1/32nd of a pixel. The interpolation technique is the same as for the luminance, except that the values resulting from the horizontal and vertical filtering operations are used independently and are thus not averaged (the same mirroring operation as for the luminance may be performed).
In theory, the matrix A must be constructed with the aid of a sum of terms at all the points of the region. Certain points may be error factors (association with a small object or parasitic motion, with an uncovered zone . . . ). A simple restriction of selecting the points is to preserve only the points (x,y) whose motion actually estimated is such that S(x,y,t)=S(x-Dx,y-Dy,t-1). This restriction is the more efficient as the segmentation is more coherent with the contents of the image.
Once the matrix A is constructed, it is tested whether it is singular or not. If it is not, an inversion by means of the method referred to as the Householder method is performed. If it is, the motion is not refined: the motion parameters (the information M.sub.i.sup.cv) remain unchanged at the start of this iteration and one proceeds directly to the selection of the better motion. The region is predicted by using both the information M.sub.i.sup.cv (t) which has been supplied in the refinement step and the information M.sub.j (t) corresponding to every region j neighboring i in S(t) and expressed in the local reference at i. The prediction error in the region is computed each time. If all the errors are higher than those corresponding to the previous information M.sub.i.sup.f (t), this better motion remains unchanged. If not, the motion generating the smallest prediction error is definitively retained, and the corresponding, new information is denoted M.sub.i.sup.f (t). As has been seen above, a controlled propagation of the retained motion is possible. For each region R.sub.j adjacent to R.sub.i, the prediction error in this region is computed on the basis of parameters M.sub.i.sup.f (t), and M.sub.j.sup.f (t) is substituted by M.sub.i.sup.f (t) if this error is smaller than the prediction error obtained from M.sub.j (t).
With such an estimation scheme, the initialization step allows to use all the available information in order to start the estimation not too far from the motion to find, while the preprocessing step simplifies the luminance signal (in order to make the convergence avoid local minima: an image pyramid is built by successive filtering operations, the first estimation occurring between a pari of strongly filtered images) and the refinement one works on the more detailed images.