1. Field of the Invention
The present invention generally relates to the field of digital image processing, particularly to processing of digital video, and specifically to segmentation of digital video frames, particularly for background replacement.
2. Description of Related Art
In digital video processing, segmentation of a video stream into distinct component objects is a known and useful technique.
For example, an input video stream may be separated into two different streams, one containing foreground subjects/objects (for the purposes of the present invention, from now on by “foreground subject” it will be intended both foreground subjects and foreground objects), and the other containing the background of the video frames. In a videocommunication (e.g. videotelephony) sequence between two persons, the foreground is for example represented by a talking person, usually limitedly to the trunk, the head and the arms (a so-called “talking head”).
The possibility of segmenting a video sequence into foreground and background streams is for example useful for changing the video sequence background, removing the original background and inserting a substitutive background of users' choice, for instance to hide the talking head surroundings, for reasons of privacy, or to share video clips, movies, photographs, TV sequences while communicating with other persons, and similar applications.
The aim of many segmentation algorithms is to analyze a digital video sequence and to generate a binary mask, the so-called “foreground mask”, wherein every pixel of every video frame of the video sequence is marked as either a background or a foreground pixel. In applications like videocommunication, the above operation is to be performed in real time, at a frame rate that, in a sufficiently fluid videocommunication sequence, is of the order of 25 to 30 frames per second (fps).
Several solutions for image segmentation have been proposed in the art.
In L. Lucchese and S. K. Mitra, “Color Image Segmentation: A State-of-the-Art Survey”, Proc. of the Indian National Science Academy (INSA-A), New Delhi, India, Vol. 67, A, No. 2, March 2001, pp. 207-221, a review of algorithms for segmentation of color images is provided.
In A. R. J. François and G. G. Medioni, “Adaptive Color Background Modeling for Real-time Segmentation of Video Streams,” Proceedings of the International Conference on Imaging Science, Systems, and Technology, pp. 227-232, Las Vegas, NA, June 1999, a system is presented to perform real-time background modeling and segmentation of video streams on a Personal Computer (PC), in the context of video surveillance and multimedia applications. The images, captured with a fixed camera, are modeled as a fixed or slowly changing background, which may become occluded by mobile agents. The system learns a statistical color model of the background, which is used for detecting changes produced by occluding elements. It is proposed to operate in the Hue-Saturation-Value (HSV) color space, instead of the traditional RGB (Red, Green, Blue) space, because it provides a better use of the color information, and naturally incorporates gray-level only processing. At each instant, the system maintains an updated background model, and a list of occluding regions that can then be tracked.
In D. Butler, S. Sridharan and V. M. Bove, Jr., “Real-time adaptive background segmentation,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-2003), pp. 349-352, April 2003, an algorithm is proposed that represents each pixel in the frame by a group of clusters. The clusters are ordered according to the likelihood that they model the background and are adapted to deal with background and lighting variations. Incoming pixels are matched against the corresponding cluster group and are classified according to whether the matching cluster is considered part of the background. The algorithm has allegedly demonstrated equal or better segmentation than the other techniques and proved capable of processing 320×240 video at 28 fps, excluding post-processing.
U.S. Pat. No. 6,625,310 discloses a method for segmenting video data into foreground and background portions that utilizes statistical modeling of the pixels; a statistical model of the background is built for each pixel, and each pixel in an incoming video frame is compared with the background statistical model for that pixel. Pixels are determined to be foreground or background based on the comparisons. In particular, a pixel is determined to match the background statistical model if the value of the pixel matches the mode for that pixel: the absolute difference is taken between the pixel value and the value of the background statistical model for the pixel (i.e., the mode), and it is compared to a threshold. If the absolute difference is less than or equal to the threshold, the pixel value is considered to match the background statistical model, and the pixel is labeled as background; otherwise, the pixel is labeled as foreground.
US 2004/0032906 discloses a method and system for segmenting foreground objects in digital video that facilitates segmentation in the presence of shadows and camera noise. A background registration component generates a background reference image from a sequence of digital video frames. A gradient segmentation component and variance segmentation component process the intensity and chromatic components of the digital video to determine foreground objects and produce foreground object masks. The segmentation component data may be processed by a threshold-combine component to form a combined foreground object mask. A background reference image is identified for each video signal from the digital video, the background reference image is subtracted from each video signal component of the digital video to form a resulting frame; the resulting frame is associated with the intensity video signal component with a gradient filter to segment foreground objects and generate a foreground object mask.
Morphological closing of the foreground mask is a known technique to reduce false background pixels in the foreground mask, as for example described in the paper by D. Butler, S. Sridharan and V. M. Bove, Jr. The morphological closing is in particular an operation adapted to correct at least some of the artifacts usually present in the foreground mask, particularly artifacts in the form of holes in the foreground subjects, caused for example by similarities between the color of the foreground subject and of the underlying background pixels.
In particular, the morphological closing operation involves two operations: a “mask dilation” operation, wherein the foreground subject areas in the foreground mask are expanded, “dilated”; and a following “mask erosion” operation, in which the foreground subject areas in the foreground mask are brought back to their original dimensions. After the mask dilation, small holes possibly present in the foreground subject areas are absorbed by the foreground, and disappear.
In the conventional morphological closing operation, the pixels in the foreground mask are processed; for each pixel, a fixed number of neighboring pixels are considered, for exampled all those pixels contained in a rectangle (the “dilation window” or “dilation mask”) of predetermined size, like 3×3 pixels or 9×9 pixels. The values of the neighboring pixels affect the resulting value of the pixel considered, after the dilation and the erosion operations.