The present invention relates to a segmentation and object tracking method applied to an image sequence, said method comprising in series the following steps:
(A) a segmentation step for defining the regions of a first coarse partition P(t-1) of an image I(t-1) and, from said coarse partition P(t-1) and on the basis of a spatial homogeneity criterion, a finer partition FP(t-1) PA1 (B) a projection step for defining a projection PFP(t) of said fine partition FP(t-1) into the current image I(t); PA1 (C) a re-labelling step of said projected fine partition PFP(t), for defining the final partition P(t). PA1 (1) a marker projection sub-step, using motion and spatial information for yielding a set of markers for the current image by means of a motion compensation of each region of the previous image; PA1 (2) a partition creation sub-step, using spatial information contained in the current image and in the previous original images for a growing process of said set of compensated markers in order to obtain said final partition. PA1 (1) said marker projection sub-step comprises itself in series the following operations: PA1 (2) said partition creation sub-step comprises itself in series the following operations:
The invention also relates to a corresponding system for carrying out said method.
One of the most known compression techniques for the transmission of image data, the cosine transform coding, does not allow to obtain a compression ratio greater than about 16:1. At low and very low bitrate coding, compression ratios can be improved by incorporating knowledge about the image contents into the coding scheme, thanks to techniques that segment objects from the background of the images, detect the segmented objects and, after having coded these objects as textures regions surrounded by contours, transmit the data related to them. However these contours and textures are not efficiently coded in a three-dimensional space (the discrete nature of the time dimension leads to great discontinuities) and, in order to reach very low bitrates, motion compensation has to be used.
This lack of connectivity between regions related to objects considered at discrete successive times may indeed be solved by including motion information in the segmentation, which is particularly necessary when large motion is present in a video sequence. This is done for example by segmenting a frame or picture F(t) (t being the time) on the basis of the segmentation already obtained for the previous frame or picture F(t-1), by computing a backward motion estimation between both frames, and carrying out a motion compensation of said previous frame F(t-1) and of its segmentation. Such a technique allows to track rather efficiently through the time domain the regions corresponding to the objects selected as areas of interest.
However, in the field of video coding, new coding schemes with embedded content-based functionalities enabling the separate manipulation and definition of the various objects of a scene (whatever the object definition that may rely on very different criteria) are a more and more active research field, especially in relation with the future MPEG-4 standard that targets interactive multimedia applications and will be probably frozen before the end of 1997 or in 1998. Once objects, or groups of objects, have been defined, they have to be tracked through the sequence. It is this tracking capability which really opens the door to content-based functionalities, allowing to relate the information between the objects in the previous frames with the current and future frames, that is, to define a temporal evolution of the objects (in addition to that, this tracking capability allows the user to mark only once the selected object).
Classical object tracking techniques, using motion as main information, may fail in tracking an object composed of several parts presenting different motions (for example, a person walking and whose arms and body move differently). In addition, motion-based tracking techniques cannot track parts of an object if the complete object follows a given motion (for example, they are not able to track only the face of a person separately from the hair). Finally, if the complete scene is static (the object does not move) or if there is a global motion of the camera (e.g. a panning), motion-based tracking techniques cannot track or may have difficulties in tracking the selected object. A static scene (or a scene that becomes static) does not provide any motion information and thus the detection of objects based on their motion is indeed difficult. Analogously, a global motion of the camera creates an apparent motion for all objects in a scene and, therefore, objects cannot be easily detected based on the separation into static and moving areas.
In order to track such types of objects, some techniques propose to cope with different object definition criteria. An object tracking method relying on the concept of partition projection (a previous image I(t-1) and its partition P(t-1) are motion compensated, I(t-1) leading to I(t) and P(t-1) to P(t), and the compensated regions are used, in the compensated image I(t), as markers that are extended in the current image by means of the well-known 3D watershed algorithm) but extending the technique to the case of regions with any type of homogeneity is described for example in "Tracking areas of interest for content-based functionalities in segmentation-based video coding", F. Marques and al., Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, May 7-10, 1996, Atlanta, Ga., USA.
Said tracking technique relies on a double partition approach, i.e. it uses two levels of partition: a coarse level partition, which is related to the coding scheme, and a finest level partition, which contains a more detailed rich description of the current image and allows the tracking of the areas of interest. For each image, both segmentations are carried out in parallel and the coarse partition constrains the finest one: all contours in the coarse partition are also present in the finest one, the ultimate goal being to obtain a final partition containing the necessary regions to efficiently code the image, as well as the necessary regions to correctly track said areas of interest.
In the present case, the partition P(t-1) of the previous image (this partition in the first level is formed by the objects that have been selected) is re-segmented, which yields a fine partition FP(t-1) guaranteeing the spatial homogeneity of each fine region. This fine partition FP(t-1) is then projected into the current image to obtain a fine partition at time t (PFP(t)), and the final partition P(t) is obtained by re-labelling said projected fine partition PFP(t).
The corresponding complete procedure is illustrated in FIG. 1, where the evolution of a selected object (partition P(t-1) at time t-1, fine partition FP(t-1) at time t-1, projected fine partition PFP(t) at time t, partition P(t) at time t), is shown. In this example, the re-labelling procedure yields unconnected components with the same label (grey areas in the projected fine partition PFP(t-1)), that are here considered as being projections errors and are therefore removed.