Older video standards, such as ISO MPEG- 1 and MPEG-2, are relatively low-level specifications primarily dealing with the temporal and spatial compression of video signals. With these standards, one can achieve high compression ratios over a wide range of applications.
Newer video coding standards, such as MPEG-4, see "Information Technology--Generic coding of audio/visual objects," ISO/IEC FDIS 14496-2 (MPEG4 Visual), November 1998, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP's). These emerging standards are intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. For example, one might want to "cut-and-paste" a moving figure from one video to another. In order to identify the figure, the video must first be "segmented." Given the amount of video, both archived and newly acquired, it is desirable for the segmentation process to be either fully automatic or semi-automatic.
In the semi-automatic case, one may provide a segmentation for the first frame. The problem then becomes one of object tracking. In the automatic case, the problem is to first identify the object, then to track the object. In either case, the segmentation process should attempt to minimize the input needed by the user, obviously, no input is optimal.
With VOP's, each frame of a video sequence is segmented into arbitrarily shaped image regions. Each VOP describes a video object in terms of, for example, shape, motion, and texture. The exact method of producing VOP's from the source imagery is not defined by the standards. It is assumed that "natural" objects are represented by shape information, in addition to the usual luminance and chrominance components. Shape data can be provided as a segmentation mask, or as a gray scale alpha plane to represent multiple overlaid objects. Because video objects vary extensively with respect to low-level features, such as, optical flow, color, and intensity, VOP segmentation is a very difficult problem.
A number of segmentation methods are known. Region-based segmentation methods include mesh-based, motion model-based, and split-and-merge. Because these methods rely on spatial features, such as luminance, they may produce false contours, and in some cases, foreground objects may be merged into the background. More recently, morphological spatio-temporal segmentation has been used. There, information from both the spatial (luminance) and temporal (motion) domains are tracked using vectors. This complex method might erroneously assign a spatial region to a temporal region, and the method is difficult to apply to a video sequence including more than one object.