A first means for transmitting to remote parties a scene involving a speaker in front of a projected backdrop is to use a video camera to pick up the scene and to transmit the scene to the remote parties over a telecommunications network.
Nevertheless, proceeding in that manner presents the drawback that the optical resolution of the video camera is generally much less than the definition of the projected digital images, such that although the gestures of the speaker are received properly, the backdrop of the image as picked up and transmitted to the remote audience becomes practically illegible, which puts a considerable limit on any interest in performing that type of remote transmission.
To remedy that drawback, one solution consists in sharing the same digital images forming the projected backdrop between the speaker and the remote audience, in extracting the user's gestures from the image as picked up, in transmitting those gestures to the remote parties, and in overlaying them on the shared images. This results in a backdrop image that retains its definition, together with the gestures of the speaker.
In order to extract from the image as picked up an object that is situated in the foreground of a backdrop, where the object is specifically the hands and the arms of the speaker, various methods have already been proposed, and in particular the method described in international application WO 2005/036456.
That known method relies on analyzing local characteristics extracted from the backdrop image, in particular by the discrete cosine transform (DCT) method. The backdrop model is estimated, pixel block by pixel block, by training on a sequence of images of the backdrop, using an assumption that local characteristics have independent Gaussian distributions. Those characteristics are then estimated in the current image, and any pixels or groups of pixels that do not comply with the training model, in application of a given thresholding criterion, are considered as belonging to objects in the foreground. Progressive updating of the backdrop model over time is performed by means of linearly weighting the training parameter between the local characteristics of the backdrop model and the characteristics coming from the current image.
Nevertheless, the segmentation as obtained in this way of foreground objects is generally fairly imprecise, particularly when the backdrop is complex. In addition, any change in the backdrop or in the position of the camera is automatically identified as forming part of the foreground, which naturally leads to major errors in extracting the looked-for objects.