The term image segmentation refers to the partition of an image into a set of non-overlapping regions that cover it. An object is composed of one or more segments, and the term image segmentation is thus closely associated with “object extraction”. The definition of the latter being well known. Image segmentation is probably one of the most important low-level techniques in vision, since virtually any computer vision algorithm incorporates some sort of segmentation. In general, a segmentation is classified as groups of pixels that have common similarities. The properties of a good image segmentation are defined as follows: regions of segments in the image segmentation should be uniform and homogeneous with respect to some characteristic such as gray tone or texture. Region interiors should be simple and without many small holes. Adjacent regions should have significantly different values with respect to the characteristic on which they are uniform. Boundaries of each segment should be simple, not ragged, and must be spatially accurate.
The motivation for the use of image segmentation as a preliminary stage for image analysis lies in the desire to transform the given image into a more compact and coherent representation, which emphasizes similar properties (attributes) of the image. We will partition the image into a number of segments, and then classify each segment as static or moving (in a video sequence) according to its relevant properties.
Existing Segmentation Algorithms
Traditional video standards such as MPEG-1, MPEG-2, H.261 or H.263 are low-level techniques in the sense that no segmentation or analysis of the scene is required. They can achieve high compression ratios, and are suitable for a wide range of applications. However, the increasing demands by multimedia applications and content-based interactivity, dictate the need to have new video coding schemes that are content-based.
The new video coding standard MPEG-4 (T. Sikora, IEEE Trans. on Circuits and Syst. for Video Technol., 7, 19–31, 1997) is a trigger and source for the development of many segmentation algorithms. MPEG-4 takes advantage of a prior decomposition of sequential video frames into video object planes (VOPs) so that each VOP represents one moving object. Each frame of the input sequence is segmented into arbitrarily shaped image regions (i.e. VOPs) such that each VOP describes one semantically meaningful object or video content of interest. A video object layer is assigned to each VOP, containing shape, motion and texture information. The following summarizes some of the most important motion segmentation and VOP generation techniques that have been proposed to date.
Decomposing a video sequence into VOPs is a very difficult task, and comparatively little research has been undertaken in this field. An intrinsic problem of VOP generation is that objects of interest are not homogeneous with respect to low-level features such as color, intensity, or optical flow. Thus, conventional segmentation algorithms will fail to obtain meaningful partitions. In addition to the many research papers and scientific activities reported below, many books were written on the subject, for example: P. Kuhn, “Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation”, Kluwer Academic Publishers, 1999; I-Jong Lin and S. Y. Kung, “Video Object Extraction and Representation: Theory and application”, Kluwer Academic Publishers, 2000; K. N. Ngan, T. Meier and D. Chai, “Advanced Video Coding Principles and Techniques”, Elsevier 1999; A. Puri and T. Chen (Editors), “Multimedia Systems, Standards, and Networks”, Marcel Dekker, 2000; G. Tziritas and C. Labit, “Motion Analysis for Image Sequence”, Elsevier, 1994) to name a few.
Motion as a Source for Segmentation
Moving objects are often characterized by a coherent motion that is distinct from that of the background. This makes motion a very useful feature for segmenting video sequences. It can complement other features such as color, intensity, or edges that are commonly used for segmentation of still images. Usually, motion is needed for classification, therefore, the term motion has to be defined. Lets denote by I(x,y;k) the intensity or luminance of pixel (x,y) in frame k. Following the definitions in (A. M. Tekalp, Ed., Digital Video Processing, Prentice-Hall, 1995), we have to distinguish between two-dimensional (2-D) apparent motion and static objects. The projection of the three-dimensional (3-D) motion onto the image plane is referred to as 2-D motion. It is the true motion that we would like to automatically detect. On the other hand, apparent motion is what we perceive as motion, and it is induced by temporal changes in the image intensity I(x,y,k). Apparent motion can be characterized by a correspondence vector field, or by an optical flow field. A correspondence vector describes the displacement of a pixel between two frames, whereas the optical flow (u,v) at pixel (x,y;k) refers to a velocity and is defined as
                              (                      u            ,            v                    )                =                  (                                                    ∂                x                                            ∂                t                                      ,                                          ∂                y                                            ∂                t                                              )                                    (1)            The optical flow and correspondence vectors are related. From Eq. (1) it can also be seen that apparent motion is highly sensitive to noise because of the derivatives, which can cause largely incorrect results. Furthermore, moving objects or regions must contain sufficient texture to generate optical flow, because the luminance in the interior of moving regions with uniform intensity remains constant. Unfortunately, we can only observe apparent motion.Motion Estimation
In addition to the difficulties mentioned above, motion estimation algorithms have to solve the so-called occlusion and aperture problems. The occlusion problem refers to the fact that no correspondence vectors exist for covered and uncovered background. To illustrate the aperture problem, we first introduce the optical flow constraint (OFC). The OFC assumes that the intensity remains constant along the motion trajectory (A. M. Tekalp, Ed., Prentice-Hall, 1995), i.e.,
                                          ⅆ                          ⅆ              x                                ⁢                      I            ⁡                          (                              x                ,                                  y                  :                  k                                            )                                      =                                                                              ∂                  I                                                  ∂                  x                                            ·                                                ∂                  x                                                  ∂                  t                                                      +                                                            ∂                  I                                                  ∂                  y                                            ·                                                ∂                  y                                                  ∂                  t                                                      +                                          ∂                I                                            ∂                t                                              =                                                    〈                                                      ∇                    I                                    ,                                      (                                          u                      ,                      v                                        )                                                  〉                            +                                                ∂                  I                                                  ∂                  t                                                      =            0                                              (        2        )            where <·,·> denote the vector inner product. The aperture problem states that the number of unknowns is larger than the number of observations. From the optical flow constraint Eq. (2) it follows that only the flow component in the direction of the gradient ∇I, the so-called normal flow, can be estimated. The orthogonal component can take on any value without changing the inner product, and is therefore not defined. Thus, additional assumptions are necessary to obtain a unique solution. These usually impose some smoothness constraints on the optical flow field to achieve continuity.
There are two ways of describing motion fields:
1. Nonparametric representation, in which a dense field is estimated where each pixel is assigned a correspondence or flow vector. Block matching is then applied, where the current frame is subdivided into blocks of equal size, and for each block the best match in the next (or previous) frame is computed. All pixels of a block are assumed to undergo the same translation, and are assigned the same correspondence vector. The selection of the block size is crucial. Block matching is unable to cope with rotations and deformations. Nevertheless, their simplicity and relative robustness make it a popular technique. Nonparametric representations are not suitable for segmentation, because an object moving in the 3-D space generates a spatially varying 2-D motion field even within the same region, except for the simple case of pure translation. This is the reason why parametric models are commonly used in segmentation algorithms. However, dense field estimation is often the first step in calculating the model parameters.
2. Parametric models require a segmentation of the scene, which is our ultimate goal, and describe the motion of each region by a set of a few parameters. The motion vectors can then be synthesized from these model parameters. A parametric representation is more compact than a dense field description, and less sensitive to noise, because many pixels are treated jointly to estimate a few parameters.
In order to derive a model or transformation that describes the motion of pixels between successive frames, assumptions on the scene and objects have to be made. Let (X,Y,Z) and (X′,Y′,Z′) denote the 3-D coordinates of an object point in frame k and k+1, respectively. The corresponding image plane coordinates are (x,y) and (x′,y′). If a 3-D object undergoes translation, rotation and linear deformation, the 3-D displacement of a point on the object is given in (G. Wolberg, “Digital Image Warping”. IEEE, 1984)
                              (                                                                      X                  ′                                                                                                      Y                  ′                                                                                                      Z                  ′                                                              )                =                                            (                                                                                          s                      11                                                                                                  s                      12                                                                                                  s                      13                                                                                                                                  s                      21                                                                                                  s                      22                                                                                                  s                      23                                                                                                                                  s                      31                                                                                                  s                      32                                                                                                  s                      33                                                                                  )                        ·                          (                                                                    X                                                                                        Y                                                                                        Z                                                              )                                +                      (                                                                                t                    1                                                                                                                    t                    2                                                                                                                    t                    3                                                                        )                                              (        3        )            It is very common to model 3-D objects by (piecewise) planar patches whose points satisfyaX+bY+cZ=1.  (4)If such a planar object is moving according to Eq. (3), the affine motion model is obtained under orthographic (parallel) projection, and the eight-parameter model under perspective (central) projection.
The 3-D coordinates are related to the image plane coordinates under the orthographic projection by(x,y)=(X,Y) and (x′,y′)=(X′, Y′)  (5)This projection is computationally efficient and provides a good approximation, if the distance between the objects and the camera is large compared to the depth of the objects. From Eqs. (3)–(5), it follows thatx′=a1x+a2y+a3y′=a4x+a5y+a6  (6)which is known as the affine model. In the case of the more realistic perspective projection, we get
                              (                      x            ,            y                    )                =                                            (                                                f                  ⁢                                                                          ⁢                                      X                    Z                                                  ,                                  f                  ⁢                                                                          ⁢                                      Y                    Z                                                              )                        ⁢                                                  ⁢                          and                        ⁢                                                  ⁢                          (                                                x                  ′                                ,                                  y                  ′                                            )                                =                                    (                                                f                  ⁢                                                                          ⁢                                                            X                      ′                                                              Z                      ′                                                                      ,                                  f                  ⁢                                                                          ⁢                                                            Y                      ′                                                              Z                      ′                                                                                  )                        .                                              (        7        )            Together with Eqs. (3) and (4), this results in the eight-parameter model
                              x          ′                =                                                                                                  a                    1                                    ⁢                  x                                +                                                      a                    2                                    ⁢                  y                                +                                  a                  3                                                                                                  a                    7                                    ⁢                  x                                +                                                      a                    8                                    ⁢                  y                                +                1                                      ⁢                                                  ⁢                          y              ′                                =                                                                      a                  4                                ⁢                x                            +                                                a                  5                                ⁢                y                            +                              a                6                                                                                      a                  7                                ⁢                x                            +                                                a                  8                                ⁢                y                            +              1                                                          (        8        )            Both the affine and the eight-parameter model are very popular, however many other transformations exist depending on the assumption made.
Parametric models describe each region by one set of parameters that is either estimated by fitting a model in the least squares sense to a dense motion field obtained by a nonparametric method, or directly from the luminance signal I(x,y:k) as in M. Hotter and R. Thoma, Signal Processing, vol. 15, no. 3, pp. 315–334, 1988, and H. G. Musmann, M. Hotter, and J. Ostermann, Signal Processing: Image Commun., vol. 1, pp. 117–138, 1989. Although parametric representations are less noise sensitive, they still suffer from the intrinsic problems of motion estimation. One has to be careful when interpreting an estimated flow field. Most likely, it is necessary to include additional information such as color or intensity, to accurately and reliably detect boundaries of moving objects.
Motion Segmentation
A classical approach to motion segmentation is to estimate a motion field, followed by a segmentation of the scene based only on this motion information (see G. Adiv, IEEE Trans. PAMI, PAMI-7, 384–401, 1985; M. Hotter and R. Thoma, Signal Processing, vol. 15, no. 3, pp. 315–334, 1988; and M. M. Chang, A. M. Tekalp and M. I. Sezan, IEEE Int. Conf. Acoust. Speech, Signal Processing, ICASSP93, Minneapolis, Minn., V, 33–36, 1993). Adiv proposes a hierarchically structured two-stage algorithm. The flow field is first segmented into connected components using the Hough transform, such that the motion of each component can be modeled by an affine transformation (Eq. 6). Adjacent components are then merged into segments if they obey the same 8-parameter quadratic motion model. In the second stage, neighboring segments that are consistent with the same 3-D motion (Eq. 3) are combined, resulting in the final segmentation.
The Bayesian framework is popular among the methods that achieve motion segmentation. There are a number of references that detail it including: P. Bouthemy and E. Francois, Int. J. Comput. Vision, 10:2, pp. 157–182, 1993; Chang et al. 1993 (see above); M. M. Chang, M. I. Sezan and A. M Tekalp, ICASSP94, pp. 221–234, 1994; D. W. Murray and B. F. Buxton, IEEE PAMI, PAMI-9, pp. 220–228, 1987; and C. Stiller, ICASSP93, pp. 193–196, 1993, and in IEEE Trans. Image Processing, 6, pp. 234–250, 1997. The key idea is to find the maximum a posteriori (MAP) estimate of the segmentation X for some given observation O, i.e. to maximize P(X|O)<P(O|X)P(X). Murray and Buxton used an estimated flow field as the observation O. The segmentation or prior model X is assumed to be a sample of a Markov random field (MRF) to enforce continuity of the segmentation labels, and thus P(X) is a Gibbs distribution. The energy function of the MRF consists of a spatial smoothness term, a temporal continuity term, and a line field as in D. Geman and D. Geman, IEEE PAMI, PAMI-6, 721–741, 1984, to allow for motion discontinuities. To define the observation model P(O|X), the parameters of a quadratic flow model (G. Adiv, IEEE Trans. PAMI, PAMI-7, 384–401, 1985) are calculated for each region by linear regression. The resulting probability function P(O|X)P(X) is maximized by simulated annealing (Geman and Geman, above). The major drawbacks of this proposal are the computational complexity, and the need to specify the number of objects likely to be found. A similar approach was taken by Bouthemy and Francois, above. The energy function of their MRF consists only of a spatial smoothness term. The observation contains the temporal and spatial gradients of the intensity function, which is essentially the same information as the optical flow due to the OFC (Eq. 2). For each region, the affine motion parameters (Eq. 6) are computed in the least-squares sense, and P(O|X) models the deviation of this synthesized flow from the optical flow constraint (Eq. 2) by zero-mean white Gaussian noise. The optimization is performed by iterated conditional modes (ICM) (J. Besag, J. Royal Statist. Soc. B, vol. 48, no. 3, pp. 259–279, 1986), which is faster than simulated annealing, but likely to get trapped in a local minimum. To achieve temporal continuity, the segmentation result of the previous frame is used as an initial estimate for the current frame. The algorithm then alternates between updating the segmentation labels X, estimating the affine motion parameters, and updating the number of regions in the scene.
The techniques of Adiv, Bouthemy and Francois, and Murray and Buxton, include only optical flow data into the segmentation decision, and hence, their performance is limited by the accuracy of the estimated flow field. In contrast, Chang et al., ICASSP93, 1993 incorporated intensity information into the observation O. The energy function of the MRF includes a spatial continuity term and a motion-compensated temporal term to enforce temporal continuity. Two methods to generate a synthesized flow field for each region were proposed: the eight-parameter quadratic model of Adiv, and the mean flow vector of the region calculated from the given field in O. For the conditional probability P(O|X) it is assumed that both the deviation of the observed flow from the synthesized flow, and the difference between the gray level of a pixel and the mean gray level of the region it belongs to, obey zero-mean Gaussian distributions. By controlling the variances of these two distributions, more weight is put on the flow data in the case where it is reliable, i.e., for small values of the displaced frame difference (DFD), and more weight on the intensity in areas with unreliable flow data. The optimization is then performed by ICM as done by Bouthemy and Francois. These results are not good since we get over-segmentation, and the method is computationally expensive.
It is possible to treat motion estimation and segmentation jointly in the Bayesian framework (see for example Chang et al., ICASSP94, 1994; Stiller, ICASSP93, 1993 and Stiller, IEEE Trans. Image Processing, 6, 234–250, 1997). In this case, the observation O consists only of the gray-level intensity, and both the segmentation and the motion field have to be estimated. Chang et. al. ICASSP94, 1994, used both a parametric and a dense correspondence field representation of the motion, with the parameters of the eight parameter-model (Eq. 8) being obtained in the least squares sense from the dense field. These approaches suffer from high computational complexity, and many algorithms need the number of objects or regions in the scene as an input parameter.
In the technique proposed by C. Stiller, IEEE Int. Conf. Acoust. Speech, Signal Processing, ICASSP93, Minneapolis, Minn., V, 193–196, 1993, the objective function consists of two terms. The DFD generated by the dense motion field is modeled by a zero-mean generalized Gaussian distribution, and an MRF ensures segment wise smoothness of the motion field, and spatial continuity of the segmentation. In C. Stiller, IEEE Trans. Image Processing, 6, 234–250, 1997, the DFD is also assumed to obey a zero-mean generalized Gaussian distribution; however, occluded regions are detected, and no correspondence is required for them.
Techniques that make use of Bayesian inference and model images by Markov random fields are more plausible than some rather ad-hoc methods. They can also easily incorporate mechanisms to achieve spatial and temporal continuity. On the other hand, these approaches suffer from high computational complexity, and many algorithms need the number of objects or regions in the scene as an input parameter.
Hierarchically structured segmentation algorithms were proposed (M. Hotter and R. Thoma, Signal Processing, vol. 15, no. 3, pp. 315–334, 1988; Musmann, M. Hotter, and J. Ostermann, Signal Processing: Image Commun., vol. 1, pp. 117–138, 1989; N. Diehl, Signal Processing: Image Commun., vol. 3, pp. 23–56, 1991). A change detector divides the current frame into changed and unchanged regions, and each connected changed region is assumed to correspond to one object. Starting from the largest changed region, the motion parameters for this object are estimated directly from the spatio-temporal image intensity and gradient. If the prediction error after motion compensation is too large, this object is further subdivided and analyzed in subsequent levels of hierarchy. The algorithm sequentially refines the segmentation and motion estimation, until all changed regions are accurately compensated. Because these techniques alternate between analyzing the image and synthesizing, they have been described as object-oriented analysis-synthesis algorithms. In Hotter and Thoma, and in Musmann, Hotter and Ostermann, the eight-parameter motion model (Eq. 8) is used, and the parameters are obtained by a direct method. The luminance function is approximated by a Taylor series expansion, so that the frame difference can be expressed in terms of spatial intensity gradients and the unknown parameters. Both frame differences and gradients are easy to compute, and the model parameters are obtained by linear regression. A 12-parameter quadratic motion model that describes a parabolic surface undergoing the 3-D motion (Eq. 3) under parallel projection is proposed by Diehl. An iterative technique that is similar to the Newton-Raphson algorithm, estimates the parameters by minimizing the MSE between the motion-compensated and the current frame. Edge information is incorporated into the segmentation algorithm to improve the accuracy of boundaries.
Morphological tools such as the watershed algorithm and simplification filters are becoming increasingly popular for segmentation and coding (J. G. Choi, S. W. Lee, and S. D. Kim, IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 279–286, 1997; F. Marques and C. Molina, in SPIE Visual Commun. Image Processing, VCIP'97, San Jose, Calif., vol. 3024, pp. 190–199, 1997; F. Meyer and S. Beucher, J. Visual Commun. Image Representation, vol. 1, pp. 21–46, September 1990; P. Salembier and M. Pardas, IEEE Trans. Image Processing, vol. 3, pp. 639–651, 1994; P. Salembier, P. Brigger, J. R. Casas, and M. Pardas, IEEE Trans. Image Processing, vol. 5, pp. 881–898, 1996). An introduction, discussion of potential problems, and several applications to segmentation are presented by. Meyer and Beucher. Salembier and Pardas describe a segmentation algorithm that has a typical structure for morphological approaches. In a first step, the image is simplified by the morphological filter “open-close by reconstruction”, to remove small dark and bright patches. The size of these patches depends on the structuring element used. The color or intensity of the resulting simplified images is relatively homogeneous. An attractive property of these filters is that they do not blur or change contours like low-pass or median filters. The following marker extraction step detects the presence of homogeneous areas, for example, by identifying large regions of constant color or luminance. This step often contains most of the know-how of the algorithm. Each extracted marker is then the seed for a region in the final segmentation. Undecided pixels are assigned a label in the decision step, the so-called watershed algorithm, which is a technique similar to region growing. The watershed algorithm is well defined and can be efficiently implemented by hierarchical FIFO queues. A quality estimation is performed in Salembier and Pardas as a last step to determine which regions require resegmentation. The proposed segmentation by Salembier et al., 1996, above, is very similar, but an additional projection step is incorporated that warps the previous partition onto the current frame. This projection, which is also computed by the watershed algorithm, ensures temporal continuity and linking of the segmentation. The result is an over-segmentation.
The segmentation algorithms in Meyer and Beucher, Salembier and Pardas 1994, and Salembier, Brigger, Casas, and Pardas, 1996, are not true video segmentation techniques. They consider video sequences to be 3-D signals and extend conventional 2-D methods, although the time axis does not play the same role as the two spatial axes. A morphological video segmentation algorithm was proposed by G. Choi, S. W. Lee, and S. D. Kim, IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 279–286, 1997. Their marker extraction step detects areas that are not only homogeneous in luminance, but also in motion, so-called joint markers. For that, intensity markers are extracted as in Salembier and M. Pardas, 1994, and affine motion parameters (Eq. 6) are calculated for each marker by linear regression from a dense flow field. Intensity markers for which the affine model is not accurate enough are split into smaller markers that are homogeneous. As a result, multiple joint markers might be obtained from a single intensity marker. The watershed algorithm also uses a joint similarity measure that incorporates luminance and motion. In a last stage, the segmentation is simplified by merging regions with similar affine motions. A drawback of this technique is the lack of temporal correspondence to enforce continuity in time.
Morphological segmentation techniques are computationally efficient, and there is no need to specify the number of objects as with some Bayesian approaches, because this is determined automatically by the marker or feature extraction step. However, due to its nature, the watershed algorithm suffers from the problems associated with region-growth techniques.
The algorithms described so far are mainly focused on coding. They segment video sequences into regions that are homogeneous with respect to motion and possibly color or luminance. For content-based functionalities as in MPEG-4, we would like to partition the frames into objects that are semantically meaningful to the human observer. Thus, the above techniques will fail in many practical situations where objects do not correspond to partitions based on simple features like motion or color. Segmentation algorithms that specifically address VOP generation have been proposed, many of them just recently with the development of the new video coding standard MPEG-4 (F. Marques and C. Molina, 1997; R. Mech and M. Wollbom, in IEEE Int. Conf. Acoust., Speech, Signal Processing, ICASSP'97, Munich, Germany, vol. 4, pp. 2657–2660, 1997; T. Meier and K. N. Ngan in ISO/IEC JTC1/SC29/WG11 MPEG97/m2238, Stockholm, Sweden, 1997; A. Neri, S. Colonnese, G. Russo, and P. Talone, Signal Processing, vol. 66, no. 2, pp. 219–232, 1998; and J. Y. A. Wang and E. H. Adelson, IEEE Trans. Image Processing, vol. 3, pp. 625–638, 1994).
Wang and Adelson, 1994, proposed a layered representation of image sequences that corresponds to the VOP technique used by MPEG-4. The current frame is segmented based on motion with each object or layer being modeled by an affine transformation (6). The algorithm starts by estimating the optical flow field, and then subdivides the frame into square blocks. The affine motion parameters are computed for each block by linear regression to get an initial set of motion hypotheses. The pixels are then grouped by an iterative adaptive K-means clustering algorithm. Pixel (x,y) is assigned to hypothesis or layer i if the difference between the optical flow at (x,y) and the flow vector synthesized from the affine parameters of layer i is smaller than for any other hypothesis. To construct the layers, the information of a longer sequence is necessary. The frames are warped according to the affine motion of the layers such that coherently moving objects are aligned. A temporal median filter is then applied to obtain a single representative image for each object. This proposal has several disadvantages. If in a sequence different views of the same object are shown, it is not possible to represent that object by a single image that is warped from frame to frame. Further, the affine transformation (6) might not be able to describe the motion of a complete layer in the presence of strongly non-rigid motion such as a person walking. The algorithm also depends completely on the accuracy of the optical flow estimates since no color or intensity information is used. Finally, the layer construction process makes real-time execution impossible, because a longer sequence of frames is required.
A double-partition approach based on morphology was suggested by Marques and Molina, 1997. Initially, objects of interest have to be selected interactively, leading to a partition at object level that corresponds to a decomposition into video object planes. These objects are normally not homogeneous in color or motion and are resegmented to obtain a fine partition that is spatially homogeneous. After estimating a dense motion field by block matching, the fine partition is projected onto the next frame using motion compensation. These projected regions are used to extract the markers for the next frame, which is then segmented by the watershed algorithm based on luminance. To improve the temporal stability, the segmentation process is guided by a change detection masks that prevents markers of static areas to overgrow moving areas and vice versa. Finally, the new object level partition is computed from the projected and segmented fine partition, whereby the algorithm must keep track of the labels of each region to know the correspondence between fine regions and objects. This is not fully automatic and some manual selection should be done in the beginning.
Automatic segmentation is formulated by Neri et al. 1998, as the problem of separating moving objects from a static background. In a preliminary stage, potential foreground regions are detected by applying a higher order statistics (HOS) test to a group of interframe differences. The nonzero values in the difference frames are either due to noise or moving objects, with the noise being assumed to be Gaussian in contrast to the moving objects, which are highly structured. In the case of moving background, the frames must first be aligned by motion compensation. For all difference frames, the zero-lag fourth-order moments are calculated because of their capability to suppress Gaussian noise. These moments are then thresholded, resulting in a preliminary segmentation map containing moving objects and uncovered background. To identify uncovered background, the motion analysis stage calculates the displacement of pixels that are marked as changed. The displacement is estimated at different lags from the fourth-order moment maps by block matching. If the displacement of a pixel is zero for all lags, it is classified as background and as foreground otherwise. Finally, the regularization phase applies morphological opening and closing operators to achieve spatial continuity and to remove small holes inside moving objects of the segmentation map. The resulting segmented foreground objects are slightly too large, because the boundary location is not directly determined from the gray level or edge image. A version of this technique is currently under investigation in the ISO MPEG-4 N2 Core Experiment on Automatic Segmentation Techniques (S. Colonnese, U. Mascia, G. Russo, and P. Talone, in ISO/IEC JTC1/SC29/WG11 MPEG97/m2365, Stockholm, Sweden, July 1997). It has a postprocessor incorporated to improve the boundary location by adjusting the boundaries to spatial edges. It is not accurate and the segments are too big.
Mech and Wollbom, 1997, generate the video object plane or object mask from an estimated change detection mask (CDM). Initially, a change detection mask is generated by taking the difference between two successive frames using a global threshold. This CDM is then refined in an iterative relaxation that uses a locally adaptive threshold to enforce spatial continuity. Temporal stability is increased by incorporating a memory such that each pixel is labeled as changed if it belonged to an object at least once in the last change detection masks. The simplification step includes a morphological close and removes small regions to obtain the final CDM. The object mask is calculated from the CDM by eliminating uncovered background and adapting to gray-level edges to improve the location of boundaries. A version of this algorithm is also part of the ISO MPEG-4 N2 Core Experiment (R. Mech and P. Gerken, in ISO/IEC JTC1/SC29/WG11 MPEG97/m1949, Bristol, U.K. 1997. It contains an additional scene change or cut detector, a global motion estimation and compensation step based on the eight-parameter model (8), and the memory length has been made adaptive.
While the two proposals (S. Colonnese, U. Mascia, G. Russo, and P. Talone, in ISO/IEC JTC1/SC29/WG11 MPEG97/m2365, 1997, and Mech and Gerken /m1949) to the ISO MPEG-4 N2 Core Experiment perform segmentation mainly based on temporal information, J. G. Choi, M. Kim, M. H. Lee, and C. Ahn, in ISO/IEC JTC1/SC29/WG11 MPEG97/m2091, Bristol, U.K., April 1997, presented a spatial morphological segmentation technique. It starts with a global motion estimation and compensation step. The global affine motion parameters (6) are calculated from the correspondence field, which is obtained by a block-matching algorithm. After that, the presence of a scene cut is examined. Then, the actual segmentation commences by simplifying the frame with a morphological open-close by reconstruction filter. The thresholded morphological gradient image, calculated from the luminance and chrominance components of the frame, serves as input for the watershed algorithm that detects the location of the object boundaries. To avoid over-segmentation, regions smaller than a threshold are merged with their neighbors. Finally, a foreground/background decision is made to create the video object planes. Every region for which more than half of its pixels are marked as changed in a change detection mask is assigned to the foreground. To enforce temporal continuity, the segmentation is aligned with that of the previous frame, and those regions for which a majority of pixels belonged to the foreground before are added to the foreground too. This allows tracking an object even when it stops moving for an arbitrary time. In contrast, the techniques Neri et al., Signal Processing, vol. 66, no. 2, pp. 219–232, 1998, and Mech and Wollbom, ICASSP'97, will lose track after a certain number of frames, depending on the size of the group of frames and memory length, respectively.
A combination of the two temporal segmentation techniques (Colonnese et al. /m2365, Mech and Gerken /m1949) with the spatial segmentation method (Choi et al., 1997) to form one algorithm is currently under investigation (P. Gerken, R. Mech, G. Russo, S. Colonnese, C. Ahn, and M. H. Lee, in ISO/IEC JTC1/SC29/WG11 MPEG97/m1948, Bristol, U.K., April 1997, and J. G. Choi, M. Kim, M. H. Lee, C. Ahn, S. Colonnese, U. Mascia, G. Russo, P. Talone, R. Mech, and M. Wollborn, in ISO/IEC JTC1/SC29/WG11 MPEG97/m2383, Stockholm, Sweden, July 1997).
A new video object plane segmentation algorithm based on Hausdorff object tracking is an extension of the technique by Meier and Ngan submitted to the ISO MPEG-4 N2 Core Experiment (Meier and Ngan, 1997). The core of the algorithm in T. Meier, K. N. Ngan, IEEE Trans. on Circuits and Syst. for Video Technol., 8:5, 525–538, 1998, is an object tracker that matches a 2-D binary model of the object against subsequent frames using the Hausdorff distance. The best match found indicates the translation the object has undergone, and the model is updated every frame to accommodate for rotation and changes in shape. The initial model is derived automatically, and a new model update method based on the concept of moving connected components allows for comparatively large changes in shape. Optical flow or motion fields could be used, but they are extremely noise sensitive, and their accuracy is limited due to the aperture and occlusion problem.
Video Standards. MPEG-4: Incapable of Automatic Extraction of Objects
Object-based coding is one of the distinct features of the MPEG-4 standard, which is distinguishable from the previous standards, such as MPEG-1 and MPEG-2. Recently, there has been growing interest in segmentation for content-based video coding. This is mainly due to the development of MPEG-4 (ISO/IEC 14496-2, \Information technology—Coding of audio-visual objects, Part 2: Visual, Amendment 1: Visual extensions”. Doc. ISO/IEC JTC1/SC29/WG11 N3056, December 1999, ISO/IEC 14496–2, MPEG-4 Video verfication model version 15.0″. ISO/IEC JTC1/SC29/WG11 N3093, December 1999, MPEG AOE Sub Group, MPEG-4 proposal package description (PPD)—revision 3″, ISO/IEC JTC1/SC29/WG11 MPEG95/N0998, July 1995), which is set to become the new video coding standard for multimedia communication. The MPEG-4 proposal package description identified key functionalities that were not or not well supported by existing video coding standards and should be supported by MPEG-4. These include content-based interactivity, hybrid natural and synthetic data coding, and content-based scalability, to name a few.
To provide these content-based functionalities, MPEG-4 relies on a content-based representation of audio-visual objects. It treats a scene as a composition of several objects that are separately encoded and decoded. This requires a prior decomposition of video sequences into VOPs. Such VOPs will normally be of arbitrary shape. However, a VOP can also consist of the whole frame if no content-based functionalities are required or to guarantee backward compatibility with MPEG-1 and MPEG-2.
Decomposing video sequences into VOPs is in many cases very difficult. If there is only one VOP consisting of the whole rectangular frame, as in current video coding standards, then no explicit segmentation is necessary. The same applies to computer-generated synthetic objects. In most other cases, however, the VOP definition must be performed by some sort of preprocessing. This can be done by automatic or semiautomatic segmentation, manually, or using blue screen (chroma key) technology. The latter method has some shortcomings. It is mainly limited to studio scenes and excludes blue objects. Manual segmentation, on the other hand, is often too time consuming.
Partitioning a video sequence into VOPs by means of automatic or semiautomatic segmentation is a very challenging task. An intrinsic problem of VOP generation is that objects of interest are not homogeneous with respect to low-level features, such as color, intensity, or optical flow. Instead, VOP segmentation involves higher level semantic concepts. Hence, conventional low-level segmentation algorithms will fail to obtain meaningful partitions. At the moment, we are not aware of any algorithm that can automatically perform VOP segmentation accurately and reliably for generic video sequences. The main difficulty is to formulate semantic concepts in a form suitable for a segmentation algorithm. Semiautomatic techniques that get some input from humans, for example, by tuning a few parameters, can significantly improve the segmentation result (J. G. Choi, M. Kim, J. Kwak, M. H. Lee, and C. Ahn, ISO/IEC JTC1/SC29/WG11 MPEG98/m3349, 1998; S. Colonnese, and G. Russo, ISO/IEC JTC1/SC29/WG11 MPEG98/m3320, 1998; C. Gu and M. C. Lee, in IEEE Int. Conf. Image Processing, ICIP'97, Santa Barbara, Calif., vol. II, pp. 514–517, 1997). Currently, this appears to be the most promising approach unless a very constrained situation is present. The most important cue exploited by a majority of techniques is motion. Physical objects are often characterized by a coherent motion that is different from that of the background.
So-called change detection masks (CDMs) and estimated flow fields are the most common forms of motion information incorporated into the segmentation process. There are some major drawbacks of CDMs for VOP segmentation. Normally, only the occlusion zones associated with moving objects are marked as changed, but not the interior of such objects. The estimated flow field on the other hand, demonstrates how difficult it can be to group pixels into objects based on the similarity of their flow vectors. In either case, it seems to be inevitable that additional information such as color or intensity must be included to accurately detect boundaries of moving objects.
Classical motion segmentation algorithms attempt to partition frames into regions of similar intensity, color, and/or motion characteristics (Adiv, 1985, Hotter and Thoma, 1988, Murray and Buxton, 1987). Many of these were inspired by the so-called second-generation coding techniques, with different objectives from those of VOP segmentation. Segmentation algorithms that specifically address VOP generation have also been proposed (Choi et al., /m2091, 1997, Colonnese and Russo, /m3320, 1998, Choi et al., /m3349, 1998, C. Gu and M. C. Lee, Semantic video, in IEEE Int. Conf. Image Processing, ICIP'97, Santa Barbara, Calif., vol. II, pp. 514–517, October 1997, R. Mech and M. Wollborn, A noise robust method for segmentation of moving objects in video sequences, in IEEE Int. Conf. Acoust., Speech, Signal Processing, ICASSP'97, Munich, Germany, vol. 4, pp. 2657–2660, April 1997, A. Neri, S. Colonnese, G. Russo, and P. Talone, Automatic moving object and background separation, Signal Processing, vol. 66, no. 2, pp. 219–232, 1998), many of them in the framework of the ISO MPEG-4 N2 Core Experiment on Automatic Segmentation Techniques.
The proposals in (Mech and Wollborn, 1997, Neri et al., Signal Processing, 1998) employ change detection masks and create one object for each area in the frame that is moving differently from the background. A spatial morphological segmentation technique is presented in. Choi et al., /m2091, 1997. The foreground/background decision is also made based on a CDM. To this end, regions for which a majority of pixels are classified as changed are assigned to the foreground. In Choi et al /m3349, 1998, and Gu and. Lee, ICIP'97, 1997, a user initially has to select objects in the scene by manual segmentation. These VOPs are then tracked and updated in successive frames. The usefulness of user interaction to incorporate high-level information has also been reported in Colonnese and Russo, /m3320, 1998. The performance of the segmentation algorithm is improved by letting a user tune a few crucial parameters on a frame-by-frame basis. In addition, the user is able to select an area containing the object of interest. This allows the algorithm to estimate critical parameters only on the region with the object instead of the whole image that might consist of several regions with different characteristics.
Video Compression
Video communication (television, teleconferencing, and so forth) typically transmits a stream of video frames (images) along with audio over a transmission channel for real time viewing and listening by a receiver. However, transmission channels frequently add corrupting noise and have limited bandwidth (such as cellular phones wireless networking). Consequently, digital video transmission with compression enjoys widespread use. In particular, various standards for compression of digital video have emerged and include H.26X (H261,H263,H263+,H26L), MPEG-1, MPEG-2, MPEG-7 with more to follow, including in development MPEG-7. There are similar audio compression methods such as CELP and MELP. These standards are described in Tekalp, Academic Press 1995.
H.261 compression uses interframe prediction to reduce temporal redundancy and discrete cosine transform (DCT) on a block level together with high spatial frequency cutoff to reduce spatial redundancy. H.261 is recommended for use with transmission rates in multiples of 64 Kbps (kilobits per second) to 2 Mbps (megabits per second).
The H.263 is analogous to H.261 but for bitrates of about 22 Kbps (twisted pair telephone wire compatible) and with motion estimation at half-pixel accuracy (which eliminates the need for loop filtering available in H.261) and overlapped motion compensation to obtain a denser motion field (set of motion vectors) at the expense of more computation and adaptive switching between motion compensation with 16 by 16 macroblock and 8 by 8 blocks.
MPEG-1 and MPEG-2 also use temporal prediction followed by two dimensional DCT transformation on a block level as H261, but they make further use of various combinations of motion-compensated prediction, interpolation, and intraframe coding. MPEG-1 aims at video CDs and works well at rates about 1–1.5 Mbps for frames of about 360 pixels by 240 lines and 24–30 frames per second. MPEG-1 defines I, P, and B frames with I frames intraframe, P frames coded using motion-compensation prediction from previous I or P frames, and B frames using motion-compensated bidirectional prediction/interpolation from adjacent I and P frames.
MPEG-2 aims at digital television (720 pixels by 480 lines) and uses bitrates up to about 10 Mbps with MPEG-1 type motion compensation with I, P, and B frames plus adds scalability (a lower bitrate may be extracted to transmit a lower resolution image).
However, the foregoing MPEG compression methods result in a number of unacceptable artifacts such as blockiness and unnatural object motion when operated at very-low-bit-rates. Because these techniques use only the statistical dependencies in the signal at a block level and do not consider the semantic content of the video stream, artifacts are introduced at the block boundaries under very-low-bit-rates (high quantization factors). Usually these block boundaries do not correspond to physical boundaries of the moving objects and hence visually annoying artifacts result. Unnatural motion arises when the limited bandwidth forces the frame rate to fall below that required for smooth motion.
MPEG-4 is to apply to transmission bit rates of 10 Kbps to 1 Mbps and is to use a content-based coding approach with functionalities such as scalability, content-based manipulations, robustness in error prone environments, multimedia data access tools, improved coding efficiency, ability to encode both graphics and video, and improved random access. A video coding scheme is considered content scalable if the number and/or quality of simultaneous objects coded can be varied. Object scalability refers to controlling the number of simultaneous objects coded and quality scalability refers to controlling the spatial and/or temporal resolutions of the coded objects. Scalability is an important feature for video coding methods operating across transmission channels of limited bandwidth and also channels where the bandwidth is dynamic. For example, a content-scalable video coder has the ability to optimize the performance in the face of limited bandwidth by encoding and transmitting only the important objects in the scene at a high quality. It can then choose to either drop the remaining objects or code them at a much lower quality. When the bandwidth of the channel increases, the coder can then transmit additional bits to improve the quality of the poorly coded objects or restore the missing objects.
In order to achieve efficient transmission of video, a system must utilize compression schemes that are bandwidth efficient. The compressed video data is then transmitted over communication channels, which are prone to errors. For video coding schemes that exploit temporal correlation in the video data, channel errors result in the decoder losing synchronization with the encoder. Unless suitably dealt with, this can result in noticeable degradation of the picture quality. To maintain satisfactory video quality or quality of service, it is desirable to use schemes to protect the data from these channel errors. However, error protection schemes come with the price of an increased bit rate. Moreover, it is not possible to correct all possible errors using a given error-control code. Hence, it becomes necessary to resort to some other techniques in addition to error control to effectively remove annoying and visually disturbing artifacts introduced by these channel induced errors.
In fact, a typical channel, such as a wireless channel, over which compressed video is transmitted is characterized by high random bit error rates (BER) and multiple burst errors. The random bit errors occur with a probability of around 0.001 and the burst errors have a duration that usually lasts up to 24 milliseconds (msec).
Error correcting codes such as the Reed-Solomon (RS) codes correct random errors up to a designed number per block of code symbols. Problems arise when codes are used over channels prone to burst errors because the errors tend to be clustered in a small number of received symbols. The commercial digital music compact disc (CD) uses interleaved codewords so that channel bursts may be spread out over multiple codewords upon decoding. In particular, the CD error control encoder uses two shortened RS codes with 8-bit symbols from the code alphabet GF(256). Thus 16-bit sound samples each take two information symbols. First, the samples are encoded twelve at a time (thus 24 symbols) by a (28,24) RS code, then the 28-symbol code-words pass a 28-branch interleaver with delay increments of 28 symbols between branches. Thus 28 successive 28-symbol code-words are interleaved symbol by symbol. After the interleaving, the 28-symbol blocks are encoded with a (32,28) RS coder to output 32-symbol code-words for transmission. The decoder is a mirror image: a (32,28) RS decoder, 28-branch de-interleaver with delay increment 4 symbols, and a (28,24) RS decoder. The (32,28) RS decoder can correct 1 error in an input 32-symbol codeword and can output 28 erased symbols for two or more errors in the 32-symbol input codeword. The de-interleaver then spreads these erased symbols over 28 code-words. The (28,24) RS decoder is set to detect up to and including 4 symbol errors which are then replaced with erased symbols in the 24-symbol output words; for 5 or more errors, all 24 symbols are erased. This corresponds to erased music samples. The decoder may interpolate the erased music samples with adjacent samples.
A number of patents have been issued in subjects related to the present invention: U.S. Pat. No. 5,048,095 discloses an adaptive image segmentation system that incorporates a closed-loop feedback mechanism in the segmentation/learning cycle. The system can adapt to changes appearing in the images being segmented. It uses a genetic algorithm to optimize the parameters of a pixel-histogram-based segmentation. U.S. Pat. No. 6,026,182 discloses feature segmentation i.e. teaches video compression based on segmenting objects and determining motion vectors for the segmented objects. The method is not fully automatic and requires user interaction. U.S. Pat. No. 5,764,792 discloses identification of rare biological cells in an image from their color, using color histograms to generate masks. U.S. Pat. No. 5,859,891 teaches an interactive method of object extraction, similar to that in U.S. Pat. No. 6,026,182, in which a user draws a polygon in a region of interest, and a computer expands the polygon to include all pixels whose gray scale level resemble the gray scale levels of pixels already within the polygon. U.S. Pat. No. 5,949,905 identifies an adhesive in an image of a printed circuit board, using gray value histograms and a priori information. None of these prior art patents are capable of robust and stable automatic object extraction and segmentation.
There is thus a widely recognized need for, and it would be highly advantageous to have, a method of robust and stable automatic object extraction for segmentation of video frames that is independent of the nature of the image, and does not depend on any specific input video sequences. There is also a widely recognized need for, and it would be highly advantageous to have, a method of robust and stable object extraction for segmentation of video frames that does not require any prior knowledge of the content of the input video sequences. There is also a need for, and it would be advantageous to have a method based on algorithms that are fast, do not consume a lot of computer resources, do not depend on statistical methods, do not produce over-segmentation, and thus enable and provide: adaptive bit allocation for video compression, interactive TV, efficient image representation, quality of service (QoS) and differentiated services (DifferServ) over diverse communication networks (narrow and broad band), video streaming, surveillance, gaming and web caching.