Many videos require processing to find objects, determine events, quantify application-dependant visual assessments, as shown by the recent MPEG-4 and MPEG-7 standardization efforts, and to analyze characteristics of video sequence, see, e.g., R. Castagno, T. Ebrahimi, and M. Kunt, “Video segmentation based on multiple features for interactive multimedia applications, ” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 8, No. 5, pp. 562-571, September 1998. Content-based video representation requires the decomposition of an image or video sequence into specific objects, i.e., separating moving persons from static backgrounds.
Many television broadcasts contain scenes where a person is speaking in front of a relatively static background, i.e., news programs, panel shows, biographies, soap operas, etc. Also, video-conference applications extensively use head-and-shoulder scenes to achieve visual communication. Increasing availability of mobile video cameras will prevail peer-to-peer, bandwidth constrained facial communication in the future. Thus, accurate object segmentation of head-and-shoulder type video sequences, also known as “talking head,” is an important aspect of video processing.
However, automatic segmentation of head-and-shoulder type sequences is difficult. Parameter based methods cannot accurately estimate the motion of an object in that type of sequence, because usually, a talking head sitting at a disk exhibits minimal motion. Moreover, motion-based segmentation methods are computationally expensive and unreliable. Region-based methods have disadvantages such as over-segmentation and can fail to determine a region-of-interest. Frame difference based methods suffer from inaccurate object shape determinations.
Another method for object segmentation utilizes volume growing to obtain the smallest color consistent components of a video, se, e.g., F. Porikli, and Y. Wang, “An unsupervised multi-resolution object extraction algorithm using video-cube,” Proceedings of Int. Conf. Image Process, Thesselaniki, 2001, see also, U.S. patent application Ser. No. 09/826,333 “Method for Segmenting Multi-Resolution Video Objects,” filed by Porikli et al. on Apr. 4, 2001. First, a fast median filter is applied to video to remove local color irregularities, see, e.g., M. Kopp and W. Purgathofer, “Efficient 3×3 median filter computations,” Technical University, Vienna, 1994. Then, a spatio-temporal data structure is formed from the input video sequence by indexing the image frames and their features. The object information can be propagated forward and as well as backward in time by treating consecutive video frames as the planes of a 3D data structure. After a video sequence is filtered, marker points are selected by color gradient. A volume around each marker is grown using color distance. The problem with video volumes is that moving objects are indistinguishable from static objects. For example, with volume growing a blank wall of a distinct color will form a volume.
A change detection mask (CDM) is a map of pixels that change between previous and the current frames of a pair of frames in a sequence of video. A CDM is defined as the color dissimilarity of two frames with respect to a given set of rules. Considering a stationary camera, consistent objects, and constant lighting conditions, the pixel-wise color difference of a pair of adjacent frames is an indication of moving objects in the scene. However, not all the color change happens because of moving objects. Camera motion, intensity changes and shadows due to the non-uniform lighting across video frames, and image noise also contribute to frame difference. The computational simplicity makes the CDM practical for real-time applications, see, e.g., C. S. Regazzoni, G. Fabri, and G. Vernazza, “Advanced video-based surveillance system”, Kluwer Academic Pub., 1999. However, using the CDM alone to determine moving objects renders poor segmentation performance.
Therefore, there is a need for an improved, fully automatic method for precisely identifying any number of moving objects in a video, particularly where the object has very little motion relative to the background, e.g., a talking head. The method should integrate both motion and color features in the video over time. The segmentation should happen in a reasonable amount of time, and not be dependent on an initial user segmentation, nor homogeneous motion constraints.