The invention relates to a method of estimating motion between pictures composed of two interlaced fields divided into blocks of pixels, which method comprises a step of searching the motion vector which is most representative of the motion between two pictures, referred to as optimum frame vector, and a step of searching the motion vector which is most representative of the motion between two fields, referred to as optimum field vector, which comprises, in parallel, two sub-steps each comprising, in series, a search for the field vector by way of block matching and a refinement of the search to within half a pixel.
The invention also relates to a circuit for estimating motion between pictures composed of two interlaced fields divided into blocks of pixels, which circuit comprises means for searching the motion vector which is most representative of the motion between two pictures, referred to as optimum frame vector, and means for searching the motion vector which is most representative of the motion between two fields, referred to as optimum field vector, said means for searching the optimum field vector comprising means for searching a field vector by way of block matching and means for refining such a search to within half a pixel.
The invention finally relates to a coding device comprising such a motion estimation circuit.
This invention is widely used in the field of digital data compression before transmission and/or storage of these data, and particularly for realising a device for coding digital television signals which are compatible with the MPEG2 standard.
It is known that the transmission in a digital form of a television picture at the actual definition requires bitrates (and would require even more bitrates at a higher definition) necessitating channels having a very wide band (for example, a sequence of colour pictures of 25 frames per second generates a digital data stream of just over 165 million bits per second). The direct transmission of such a quantity of information cannot be realised economically and, in order that this information can make use of the existing networks for data transmission and storage media, the number of data is compressed. This compression is obtained essentially by utilizing the strong spatial and temporal correlation between neighbouring pixels, as will hereinafter be described in greater detail.
First, it will be useful to recall the main characteristics of the MPEG2 standard. A sequence of digital signals in conformity with this standard comprises information relating to the luminance component Y and the chrominance components, or colour difference signals U and V (the grey levels for the luminance Y and the colour levels for the signals U and V are expressed in digital words of 8 bits). In accordance with the input format provided by the MPEG standard, the chrominance is subjected to a sub-sampling by four with respect to the luminance. There are thus two values which are related to the colour (one for U, the other for V) for four luminance values. As the matrices of words of 8 bits are arranged in blocks of 8.times.8 pixels, four adjacent blocks of the matrix Y correspond to one block of the matrix U and one block of the matrix V, and these six blocks combined constitute a macroblock (these blocks and macroblocks are the picture subdivision units which are subjected to coding, as in the example cited hereinbefore). Finally, a regrouped series of macroblocks constitutes a slice and each picture is composed of a given number of slices, for example 36 in the example described.
The pictures (the assembly of digital signals corresponding thereto is also concisely referred to as such) are of three types in a stream of MPEG data, dependent on the coding mode applied. The simplest are the pictures I (Intraframe coded pictures) in which all the macroblocks are coded independently of any other picture. If the transmission channel is modified or the bitrate is switched, one should expect such a picture of the type I for reconstituting the new information resulting from said modification at the decoding end. The pictures P (Predictive coded pictures) constitute a second type of pictures which are predicted by unidirectional motion compensation based on a preceding picture (of the type I or of the type P itself) and which thus only contain macroblocks of the type P or of the type I. Finally, the pictures B (Bidirectionally predictive coded pictures) predicted by bidirectional motion compensation based on a previous picture and a subsequent picture (which are themselves of the type I and/or P) may indifferently contain macroblocks of the type I, P or B.
A data stream comprises six information levels (each level including the lower level and some additional information components). The sequence of pictures corresponds to the highest level. This sequence is composed of a succession of groups of pictures (or GOP) each comprising a given number of pictures. A picture such as is shown in FIG. 1 includes a given number n of picture slices S.sub.1 to S.sub.n, each picture slice includes a given number of macroblocks MB and each macroblock MB comprises a given number r of blocks, here 6 (Y.sub.1, Y.sub.2, Y.sub.3, Y.sub.4, U, V) which constitute the last level.
After the description of the MPEG standard features and before the description of the invention, an example of a state-of-the-art coding device will be now described. Such a device, shown in FIG. 2, comprises a coding channel which includes, in series, a discrete cosine transform circuit (DCT), a quantizing circuit 2 and a variable-length coding circuit 3 (VLC). This DCT transform circuit 1 receives, via a subtracter 19 mentioned below, digital signals which correspond to the input video signals of the coding device (obtained, more specifically, by the difference between these input signals and the predicted signals which are present at the other input of the subtracter) and which are available in the form of blocks, here having the format of 8.times.8 pixels. This circuit 1 converts these signal blocks into blocks of 8.times.8 coefficients whose first coefficient represents the average value of the grey levels of the pixels of the block considered and whose sixty-three other coefficients represent the different spatial frequencies in this block.
The quantizing circuit 2 quantizes each of these output coefficients of the circuit 1. On the one hand, this quantization is related to the position of the coefficient considered in the 8.times.8 block (the high spatial frequencies are less perceptible to the human eye and corresponding coefficients may thus be quantized with a quantization step which is larger and gives a less precise quantization) and on the other hand to a quantization factor which is related to the bitrate. The values resulting from this quantization are then applied to the variable-length coding circuit 3, the output of which is connected in a bitrate control stage 15 to a buffer memory 4 for storing the coded words, whose output constitutes that of the coding device. As a function of the fullness of this memory 4, a bitrate control circuit 5 arranged at the output of said memory applies the quantization factor mentioned above to the quantizing circuit 2, and the value of this factor, related to this fullness, provides the possibility of modifying the quantization step in such a way that said memory 4 neither overflows nor underflows. Such a coding chain with bitrate control is known and will not be further described.
The values resulting from the quantization are also applied to a prediction channel including, in series, an inverse quantizing circuit 6 (denoted Q.sup.-1), an inverse discrete cosine transform circuit 7 (denoted DCT.sup.-1), an adder 8, a picture memory 9, a circuit 18 for motion estimation and compensation based on the original picture and the non-compensated picture which is stored in said memory, and a subtracter 19 forming the difference between the input signals and the predicted signals which are available at the output of the circuit 18 (these predicted signals are also applied to a second input of the adder 8) so as to apply only the difference between these signals to the coding channel and to treat in this channel only these differences while taking the motion between the predicted picture (based on the preceding picture) and the input picture (or current picture) into account.
The pictures of the type P or B are predicted by estimating the shift(s) within the pictures with respect to the comparison picture(s) by means of motion vectors which are also transmitted in the data stream (most frequently in a differential manner with respect to a motion vector which has already been transmitted). This motion estimation, which is based on the luminance information components, consists of projecting the current block (or macroblock) for which the motion must be evaluated in the comparison image(s) and to compare it (or them) with all the possible blocks (macroblocks) within a search window which limits the maximum possibilities of the motion estimator. The block (macroblock) estimated in the current picture is that of the comparison picture(s) which, in a certain neighbourhood, is the most similar by correlation, the best possible correspondence generally being defined in accordance with a criterion of minimum distortion such as, for example the search to obtain the lowest possible sum of the differences between the luminances of the pixels of the two blocks or macroblocks (the criterion being referred to as MAE--Mean Absolute Error--criterion).
Various documents such as, for example patent application EP-0 560 577 or the article "Adaptive frame/field motion compensated video coding" by A. Puff, R. Aravind and B. Haskell, which appeared in "Signal Processing", vol. 5, nos. 1-2, February 1993, pp. 39-58 published by the European Association for Signal Processing (EURASIP) describe the principles of such a motion estimation and also of motion compensation which is effected at the decoding end on the basis of motion vectors thus defined, and then transmitted and/or stored. As indicated in these documents, the presence of lines from two television picture fields is exploited in the blocks (or macroblocks) under consideration, and the two motion estimation modes are defined: a frame mode and a field mode, which contributes to a realisation of the motion estimation circuit as shown in FIG. 3.
In this Figure, the motion estimation is realised in two channels, a channel 10 for searching the optimum frame vector and a channel 20 for searching the possible optimum field vectors. In each channel the motion estimation is realised in two stages: first, a search of the entire optimum motion vector(s), then a more local search of an improved vector to within half a pixel. Finally, a comparison of the motion vectors thus obtained provides the possibility of selecting one of them as being the most representative of the motion of the current block (macroblock).
More particularly, the channel 10 comprises to this end a stage 100 for searching the frame vector. Such a stage comprises, for example memories which are arranged in series and a block matching correlator. Such a correlator, which is of the conventional type and is described, for example in the document "Displacement measurement and its application in interframe image coding", by J. R. Jain and A. K. Jain, published in "IEEE Transactions on Communications", vol. COM-29, no. 12, December 1981, pp. 1799-1808, has for its object to select a motion vector to which a minimum approximation error corresponds after a search of all the possible vectors in the adopted search range. This block matching search is slightly different as the picture is of the type P or of the type B. If a picture of the type P is used, the estimated block (macroblock) is obtained from another block (macroblock) of the comparison picture, taking into account the motion which is defined by the optimum frame vector. If a picture of the type B is used, the estimated block (macroblock) is the average of the two blocks (macroblocks) obtained in a similar manner from a block (macroblock) of the previous picture and a block (macroblock) of the subsequent picture.
The stage 100 for searching the frame vector is followed by a circuit 130 for refining the search to within half a pixel, for example a symmetrical spatial interpolation filter as indicated in the cited document in "Signal Processing", February 1993. With respect to the position of a pixel resulting from the estimation before refinement, this refinement to within half a pixel takes into account the eight neighbouring positions to within half the pixel surrounding this pixel. If S is the luminance and (x,y) are the horizontal and vertical coordinates of said pixel, we have, for example: EQU S(x+0.5; y)=(S(x; y)+S(x+1; y))/2 (1) EQU S(x; y+0.5)=(S(x; y)+S(x; y+1))/2 (2) EQU S(x+0.5; y+0.5)=(S(x; y)+S(x+1; y)+S(x; y+1)+S(x+1; y+1))/4(3)
The reference V.sub.0 denotes this selected frame vector after refinement of the search, and the nature of the output signal of circuit 130 will hereinafter be indicated.
The channel 20 for searching the possible optimum field vectors comprises a stage 200 for searching field vectors including two sub-stages 210 and 220 for searching the field vector of a structure which is similar to that of the stage 100 (memories and block or macroblock-matching correlator). For this search of the optimum field vectors, two block-matching searches are performed, one for the block (macroblock) comprising the lines of a field having a given parity (i.e. twice fewer lines than in the case of stage 100), and the other for the block (macroblock) comprising the lines of the field of opposite parity. For each field, the reference block (macroblock) may be found either in the even field or in the odd field of the reference frame. As in the case of channel 10, the sub-stages 210 and 220 of the stage 200 are followed by circuits 230 and 240 for refining the search to within half a pixel. These circuits 230 and 240 have a structure which is identical to that of the circuit 130 and the references V.sub.1 and V.sub.2 denote the field vectors which can be selected (for the odd and even fields, respectively).
Finally, a decision circuit 250 receiving, in parallel, the three outputs of the circuits 130, 230, 240 for refining the search to within half a pixel, provides the possibility of choosing between frame coding or field coding, for example by comparing in these different cases and in accordance with the same criterion as hereinbefore (criterion of minimum distortion), the sums of the differences between luminances of the pixels of the blocks (macroblocks) concerned and by choosing that vector from the three vectors V.sub.0, V.sub.1, V.sub.2 for which this sum is the smallest. In the present case, the signals present at said three outputs of the circuits 130, 230, 240 are thus said sums corresponding to the three vectors V.sub.0, V.sub.1, V.sub.2, respectively, and they are indicated in FIG. 3 by the references S(V.sub.0), S(V.sub.1), S(V.sub.2), respectively. Similarly, the outputs of the stage 100 and the sub-stages 210 and 220 are denoted by S(U.sub.0), S(U.sub.1), S(U.sub.2), respectively, in which U.sub.0, U.sub.1, U.sub.2 are the frame and field vectors selected before refinement of the search and respectively corresponding to these sums.
In the motion estimation circuit of FIG. 3, the stages 100 and 200 are however the most expensive because of their complexity.