The invention relates to a method of estimating motion between images forming a sequence P(t-n), P(t-n+1), . . . , P(t-2), P(t-1), P(t), . . . , corresponding to a sequence S(t-n), S(t-n+1), . . . , S(t-2), S(t-1), S(t), . . . , of segmented images, or partitions, composed of I regions R.sub.i identified by labels, and to a corresponding device for motion estimation. The invention also relates to a system of encoding segmented images by means of this method and device.
The invention is particularly suitable for encoding video signals in the field of very small bitrates and in the field of small bitrates up to approximately 1 Mbit/second. This range of bitrates notably corresponds to consumer applications, often termed as multimedia applications.
For ten to fifteen years, the compression of stationary or animated images has become a major industrial technological feature progressively covering numerous sectors: digital television, data storage, telemonitoring, videophone systems. However, other applications are currently coming up such as, for example, multimedia applications based on local data networks, the transmission of images to mobile systems, or videophone systems for switched telephone networks, constituting as many new challenges. While maintaining an equal image quality, these applications, which are based on media used at a lower cost because of their reduced passband, require compression rates which are higher than those used within the framework of the major image encoding standards such as H261, JPEG, MPEG1 or MPEG2. Moreover, services which are proposed for these media, enable users to interact on the contents of the video images, i.e. to have direct access to the different constituent objects in order to manipulate them. Several methods complying with these requirements of compression and interactivity will certainly be contradictory within the framework of standardization procedures for image encoding currently carried out by the MPEG committee (Moving Picture Experts Group) of the ISO (International Standardization Organization) for finalizing the future standard MPEG4 by 1998.
Whatever the method which will then be used, the necessity of compressing animated images requires efficient methods of compensating motion of these images and thus a pre-estimation of these motions. On the other hand, the necessity of being able to interact on the image contents requires a representation which is referred to as motion "object" of different elements of each of these images.
In a sequence of images, a conventional method of motion estimation between two of these images (referred to as previous and subsequent images) consists of subdividing each of these images into a bidimensional network of adjacent elementary blocks of equal dimensions and of applying the block matching method, hereinafter referred to as BMA--of Block Matching Algorithm--which is described, for example in the article "A VLSI architecture for hierarchical motion estimation" in the magazine "IEEE Transactions on Consumer Electronics", Vol. 41, No. 2, May 1995, pp. 248-257). This technique supposes that the blocks are sufficiently small (composed of, for example 16.times.16 pixels, which is a non-limitative example) so that the motion of each of these images can be considered as simple translations parallel to the plane of the image, while all the pixels of a block are supposed to have the same motion. Thus, a block of the subsequent image may be compared with blocks occupying the same position in the previous image or neighboring positions bounded by a search window so as to select that block from these previous blocks of a defined number which most resembles the reference block of the subsequent image. The relative position of the selected block and of the reference block defines a motion vector indicating the translation from one block to the other between the previous image and the subsequent image.
These comparison operations, which are repeated for all the blocks of the subsequent image, associate a field of motion vectors with these blocks. When the information corresponding to the pixels of a block is to be subsequently encoded and then transmitted and/or stored, it is sufficient to encode and then transmit and/or store the corresponding motion vectors instead: based on the block selected in the previous image, these vectors provide information about the new position of the block after its displacement in the subsequent image under consideration.
However, although this technique is also suitable for the applications mentioned hereinbefore, this technique has the following drawback, namely, the images are perceived as bidimensional signals without the effective contents of these images being taken into account: there is normally no reason that the contours of the elementary blocks and those of the objects which are really present in the scenes coincide. The block-matching method thus leads to degradation of the images, for example, when the boundary between two objects of an image which are each moving in a distinct manner is in the middle of a block. In this case, the motion estimation is no longer reliable and the restored images have a poor quality. When, in contrast, a region having a large surface area and a homogeneous motion comprises numerous blocks which are each associated with the same information, the resultant excessive redundance of information is also detrimental to the effectiveness of the encoding operation.
It should be noted that the motions between one image and another may be naturally very different. The local motion of objects, which may be assimilated with a translation in numerous cases, is often superimposed on motions of the pick-up camera, such as zoom (motion at a fixed or variable focus along an axis perpendicular or transversal axis to the plane of the images) and panning motions (motions of rotation at a sufficiently small angle around an axis which is substantially parallel to the plane of the images). If the local analysis of the motion has contributed, for example, to the determination of a field of motion vectors satisfactorily describing the translations between the blocks of one image and another, these global zoom and/or panning motions will disperse the field of vectors. A global analysis of the motions must thus be effected simultaneously, but the number of parameters describing all these motions becomes increasingly important.
In the envisaged applications, these restrictions of the BMA block-matching method has led to the development of other techniques which are based on a specific analysis of the image and on a better comprehension of its structure. This analysis consists of considering an image as the projection of a three-dimensional scene comprising stationary and animated objects and of trying to identify these different objects in each image and then estimate their representative parameters (which are related, for example to their shape, color, texture, motion, etc.), i.e. in a more general manner so as to define a segmentation of images into regions R.sub.i which are both individual and homogeneous with respect to a given criterion.
The document "Region-based motion analysis for video encoding at low bitrates" by H. Sanson, CCETT, pp. 1-8, published by ISO under the reference ISO/IEC-JTTCl/SC29/WG11/MPEG94 in March 1994, describes a method with which both a segmentation of images of a video sequence into regions homogeneous with respect to motion and a satisfactory estimation of the parameters describing the motions in these regions can be performed. However, this method only seems to be appropriate in situations where no information relating to the image contents is available.