Older video standards, such as ISO MPEG-1 and MPEG-2, are relatively low-level specifications primarily dealing with the temporal and spatial compression of entire videos.
Newer video coding standards, such as MPEG-4 and MPEG-7, see “Information Technology—Generic coding of audio/visual objects,” ISO/IEC FDIS 14496-2 (MPEG4 Visual), November 1998, allow arbitrary-shaped video objects to be encoded and decoded as separate video object planes (VOP's). These emerging standards are intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. For example, one might want to “cut-and-paste” a moving figure from one video to another. In order to identify the figure, the video must first be “segmented.” It is possible to segment video objects under user control, i.e., semi-automatic, or unsupervised, i.e., fully automatically.
In the semi-automatic case, a user can provide a segmentation for the first frame of the video. The problem then becomes one of video object tracking. In the fully automatic case, the problem is to first identify the video object, then to track the object through time and space. Obviously, no user input is optimal.
With VOP's, each frame of a video is segmented into arbitrarily shaped image regions. Each VOP describes a video object in terms of, for example, shape, color, motion, and texture. The exact method of producing VOP's from the video is not defined by the above standards. It is assumed that “natural” objects are represented by shape information, in addition to the usual luminance and chrominance components. Shape data can be provided as a segmentation mask, or as a gray scale alpha plane to represent multiple overlaid video objects. Because video objects vary extensively with respect to low-level features, such as, optical flow, color, and intensity, VOP segmentation is a very difficult problem.
A number of segmentation methods are known. Region-based segmentation methods include mesh-based, motion model-based, and split-and-merge. Because these methods rely on spatial features, such as luminance, they may produce false object boundaries, and in some cases, foreground video objects may be merged into the background. More recently, morphological spatio-temporal segmentation has been used. There, information from both the spatial (luminance) and temporal (motion) domains are tracked using vectors. This complex method can erroneously assign a spatial region to a temporal region, and the method is difficult to apply to a video including more than one object.
Generally, unsupervised object segmentation methods can be grouped into three broad classes: (1) region-based methods that use a homogeneous color criterion, see M. Kunt, A. Ikonomopoulos, and M. Kocher, “Second generation image coding,” Proc. IEEE, no.73, pp.549-574, 1985, (2) object-based approaches that use a homogeneous motion criterion, and (3) object tracking.
Although color-based methods work well in some situations, for example, where the video is relatively simple, clean, and fits the model well, they lack generality and robustness. The main problem arises from the fact that a single video object can include multiple different colors.
Motion-oriented segmentation methods start with an assumption that a semantic video object has homogeneous motion, see B. Duc, P. Schtoeter, and J. Bigun, “Spatio-temporal robust motion estimation and segmentation,” Proc. 6th Int. Conf. Comput. Anall. Images and Patterns, pp. 238-245, 1995. These methods either use boundary placement schemes, or region extraction schemes, see J. Wang and E. Adelson, “Representing moving images with layers,” IEEE Trans. Image Proc., no.3, 1994. Most of these methods are based on rough optical flow estimation, or unreliable spatio-temporal segmentation. As a result, these methods suffer from the inaccuracy of object boundaries.
The last class of methods for object segmentation uses tracking, see J. K. Aggarwal, L. S. Davis, and W. N. Martin, “Corresponding processes in dynamic scene analysis”, Proc. IEEE, no.69, pp. 562-572, May 1981. However, tracking methods need user interaction, and their performance depends extensively on the initial segmentation. Most object extraction methods treat object segmentation as an inter- or intra-frame processing problem with some additional parametric motion model assumptions or smoothing constraints, and disregard 3D aspect of the video data.
Therefore, there is a need for a fully automatic method for precisely segmenting any number of objects in a video into multiple levels of resolution. The method should use both motion and color features over time. The segmentation should happen in a reasonable amount of time, and not be dependent on an initial user segmentation, nor homogeneous motion constraints.