1. Technical Field
The invention is related to a system and method for segmenting image and video data. More particularly, this invention is related to a system and method for segmenting image and video data using an anisotropic kernel mean shift technique.
2. Related Art
Image segmentation refers to identifying homogenous regions in an image, while video segmentation refers to the joint spatial and temporal analysis on video sequences to extract regions in the dynamic scenes. Both of these tasks are misleadingly difficult and have been extensively studied for several decades. Generally, spatio-temporal video segmentation can be viewed as an extension of image segmentation from a 2D to a 3D lattice. Recently, mean shift based image and video segmentation has gained considerable attention due to its promising performance.
Many other data clustering methods have been described in the literature, ranging from top down methods such as K-D trees, to bottom up methods such as K-means and more general statistical methods such as mixtures of Gaussians. In general, these methods have not performed satisfactorily for segmenting image data due to their reliance on an a priori parametric structure of the data segment, and/or estimates of the number of segments expected. The appeal of image and video segmentation using mean shift is derived from both its performance and its relative freedom from specifying an expected number of segments. This freedom has come at the cost of having to specify the size (bandwidth) and shape of an influence kernel for each pixel in advance.
The difficulty in selecting the kernel for mean shift segmentation was recognized in [3,4] and was addressed by automatically determining a bandwidth for spherical kernels.
Rather than begin from an initial guess at the segmentation, such as seeding points as in K-means, mean shift begins at each data point (or pixel in an image or video) and first estimates the local density of similar pixels (i.e., the density of nearby pixels with similar color). Carefully defining “nearby” and “similar” can have an important impact on the results. This is the role the kernel plays. More specifically, mean shift algorithms estimate the local density gradient of similar pixels. These gradient estimates are used within an iterative procedure to find the peaks in the local density. All pixels that are drawn upwards to the same peak are then considered to be members of the same segment.
As a general nonparametric density estimator, mean shift is an old pattern recognition procedure proposed by Fukunage and Hostetler [7], and its efficacy on low-level vision tasks such as segmentation and tracking has been extensively exploited recently. In [1,5], it was applied for continuity preserving filtering and image segmentation. Its properties were reviewed and its convergence on lattices was proven. In [2], it was used for non-rigid objects tracking and a sufficient convergence condition was given. Applying mean shift on a 3D lattice to obtain a spatio-temporal segmentation of video was achieved in [6], in which a hierarchical strategy was employed to cluster pixels of 3D space-time video stack, which were mapped to 7D feature points (position(2), time(1), color(3), and motion(1)).
The application of mean shift to an image or video consists of two stages. The first stage is to define a kernel of influence for each pixel xi. This kernel defines a measure of intuitive distance between pixels, where distance encompasses both spatial (and temporal in the case of video) as well as color distance. All the approaches described above used a simple static radially symmetric kernel for the mean shift procedure.
The second stage first assigns to each pixel a mean shift point, M(xi), initialized to coincide with the pixel. These mean shift points are then iteratively moved upwards along the gradient of the density function defined by the sum of all the kernels until they reach a stationary point (a mode or hilltop on the virtual terrain defined by the kernels). The pixels associated with the set of mean shift points that migrate to the (approximately) same stationary point are considered to be members of a single segment. Neighboring segments may then be combined in a post process.
Mathematically, the general multivariate kernel density estimate at the point, x, is defined by
                                          f            ^                    ⁡                      (            x            )                          =                              1            n                    ⁢                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                                          K                H                            ⁡                              (                                  x                  -                                      x                    i                                                  )                                                                        (        1        )            where the n data points xi represent a sample from some unknown density f, or in the case of images or video, the pixels themselves.KH(x)=|H|−1/2K(H−1/2x)  (2)where K(z) is the d-variate kernel function with compact support satisfying the regularity constraints as described in [13], and H is a symmetric positive definite d×d bandwidth matrix. For the radially symmetric kernel, one hasK(z)=ck(∥z∥2)  (3)where c is the normalization constant. A common practice when applying a mean shift procedure on an image or video lattice is to assume a global spherical bandwidth, H=h2I. In this way the kernel density estimator becomes
                                          f            ^                    ⁡                      (            x            )                          =                              1                                          n                ⁡                                  (                  h                  )                                            d                                ⁢                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                          K              ⁡                              (                                                      x                    -                                          x                      i                                                        h                                )                                                                        (        4        )            
For image and video segmentation, the feature space is composed of two independent domains: the spatial/lattice domain and the range/color domain. One maps a pixel to a multi-dimensional feature point which includes the p dimensional spatial lattice (p=2 for image and p=3 for video) and q dimensional color (q=3 for L*u*v color space). Due to the different natures of the domains, the kernel is usually broken into the product of two different radially symmetric kernels (superscript s will refer to the spatial domain, and r to the color range):
                                          K                                          h                s                            ,                              h                r                                              ⁡                      (            x            )                          =                              c                                                            (                                      h                    s                                    )                                p                            ⁢                                                (                                      h                    r                                    )                                q                                              ⁢                                    k              s                        ⁡                          (                                                                                                                                                        x                        s                                                                    h                        s                                                                                                                                  2                            )                                ⁢                                    k              ′                        ⁡                          (                                                                                                                                                        x                        r                                                                    h                        r                                                                                                                                  2                            )                                                          (        5        )            where xs and xr are respectively the spatial and range parts of a feature vector; ks and kr are the profiles used in the two domains; hs and hr are employed bandwidths in two domains; and c is the normalization constant. With the kernel from (5), the kernel density estimator is
                                          f            ^                    ⁡                      (            x            )                          =                              c                                                            n                  ⁡                                      (                                          h                      s                                        )                                                  p                            ⁢                                                (                                      h                    r                                    )                                q                                              ⁢                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                                                            k                  s                                ⁡                                  (                                                                                                                                                                                                                      x                              s                                                        -                                                          x                              i                              s                                                                                                            h                            s                                                                                                                                                              2                                    )                                            ⁢                                                k                  r                                ⁡                                  (                                                                                                                                                                                                                      x                              r                                                        -                                                          x                              i                              r                                                                                                            h                            r                                                                                                                                                              2                                    )                                                                                        (        6        )            As apparent in Eqns. 5 and 6, there are two main parameters that have to be defined by the user for the static radially symmetric kernel based approach: the spatial bandwidth hs and the range bandwidth hr. Although manual bandwidth selection can produce satisfactory results on general image segmentation, it has a significant limitation: the algorithm is sensitive to the initial bandwidths. When local characteristics of the feature space differ significantly across the data, it is difficult to select globally optimal bandwidths. As a result, in the segmented image some objects may appear too coarse while others are too fine. Two efforts still using radially symmetric kernels have been reported to address this problem. Singh and Ahuja [12] first determine local bandwidths using Parzen windows to estimate local density. Another variable bandwidth mean shift procedure was proposed in [3], in which the estimator (6) is changed to
                                          f            ^                    ⁡                      (            x            )                          =                              1            n                    ⁢                                    ∑                              i                =                1                            n                        ⁢                                          c                                                                            (                                              h                        i                        s                                            )                                        p                                    ⁢                                                            (                                              h                        i                        r                                            )                                        q                                                              ⁢                                                k                  s                                ⁡                                  (                                                                                                                                                                                                                      x                              s                                                        -                                                          x                              i                              s                                                                                                            h                            i                            s                                                                                                                                                              2                                    )                                            ⁢                                                k                  r                                ⁡                                  (                                                                                                                                                                                                                      x                              r                                                        -                                                          x                              i                              r                                                                                                            h                            i                            r                                                                                                                                                              2                                    )                                                                                        (        7        )            There are now important differences between (6) and (7). First, potentially different bandwidths his and hir are assigned to each pixel, xi, as indicated by the subscript i. Equations [3] and [4] offer a data driven way to select a different set of bandwidth parameters to obtain an optimal tradeoff between bias and variance when estimating {acute over (f)}. Second, the different bandwidths associated with each point appear within the summation. This is the so-called sample point estimator [3], as opposed to the balloon estimator defined in Equation (6). The sample point estimator ensures that all pixels respond to the same global density estimation during the segmentation procedure. Note that the sample point and balloon estimators are the same in the case of a single globally applied bandwidth. Advantages of the variable bandwidth over the fixed bandwidth mean shift were demonstrated on synthetic 1D mixtures of Gaussians in which some Gaussians were more heavily sampled than others [3]. In particular, larger bandwidths are selected in sparse regions to overcome the effects of noise. The differences on general images was discussed briefly and video applications were left for future work.
During the iterative stage of the mean shift procedure, the mean shift points associated with each pixel climb to the hilltops of the density function. At each iteration, each mean shift point is attracted in varying amounts by the sample point kernels centered at nearby pixels. More intuitively, a kernel represents a measure of the likelihood that other points are part of the same segment as the point under the kernel's center. With no a priori knowledge of the image or video, actual distance (in space, time, and color) seems an obvious (inverse) correlate for this likelihood; the closer two pixels are to one another the more likely they are to be in the same segment.
Although the previous mean shift segmentation techniques are very advantageous, there are some disadvantages of using the mean shift segmentation techniques which employ a radially symmetric kernel. Radially symmetric kernels do not adapt well to non-compact (i.e., long skinny) local features. Such features are even prevalent in video data from stationary or from slowly or linearly moving cameras. When considering video data, a spatio-temporal slice (parallel to the temporal axis) is as representative of the underlying data as any single frame (orthogonal to the temporal axis). Such a slice of video data exhibits stripes with a slope relative to the speed at which objects move across the visual field. In particular, a still background will show as vertical stripes in a spatio-temporal slice. The problems in the use of radially symmetric kernels is particularly apparent in these spatio-temporal slice segmentations. The irregular boundaries between and across the stripe-like features cause a lack of temporal coherence in the video segmentation.
Therefore, what is needed is a system and method for segmenting image and video data that accurately segments non-compact objects.
It is noted that in the remainder of this specification, as well as in the paragraphs above, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.