Reliable data-driven bandwidth (or “scale”) selection for kernel-based nonparametric analysis of multivariate data is complex and largely unanswered by the current techniques. Depending on the prior knowledge on input data, two classes of problems can be distinguished. If the data statistics are homogeneous, then one global bandwidth suffices for the analysis. If, however, the data statistics are changing across the feature space, local bandwidths should be computed. Unfortunately, most of the tasks encountered in autonomous vision reduce to the latter class of problems, i.e., the input is represented by multidimensional features, whose properties are variable in space (and might change in time). Examples of such tasks are background modeling, tracking, or segmentation.
Statistical methods compute the global bandwidth as the bandwidth that achieves the best balance between the bias and variance of the density estimate obtained with that bandwidth, over the entire space. For the univariate, case, a reliable method for computing a global bandwidth is the known “plug-in rule” (see, e.g., S. J. Sheather, et al., “A Reliable Data-based Bandwidth Selection Method for Kernel Density Estimation”, J. Royal Statist. Soc. B. 53(3):683 690. 1991)), which has been shown to be superior to least squares cross validation and biased cross-validation estimation methods. The only assumption with the plug-in rule is the smoothness of the underlying density. Although the plug-in rule may be used to efficiently compute the global bandwidth, the global bandwidth is not effective when data exhibits multi-scale patterns. In addition, for the multivariate case, the optimal bandwidth formula is of little practical use, since it depends on the Laplacian of the unknown density being estimated.
Another global bandwidth selection approach relates to the stability of the decomposition. The bandwidth is taken as the center of the largest operating range over which the same number of partitions are obtained for the given data. This strategy is also implemented within the framework of scale-space theory and relies on the space homogeneity assumption. All the partitions should have roughly the same scale, which is not always true.
A commonly used method for computing local bandwidths follows Abramson's rule which takes the bandwidth proportional to the inverse of the square root of a first approximation of the local density (see, e.g., Abramson, “On Bandwidth Variation in Kernel Estimates—A Square Root Law”, The Annals of Statistics, 10(4):1217–1223, 1982). The proportionality constant is an important choice of the method.
In a different class of techniques, the optimal bandwidth maximizes an objective function, which expresses the quality of the decomposition and is called index of cluster validity. The objective function compares inter-versus intra-cluster variability, or evaluates the isolation and connectivity of the delineated clusters. Nevertheless, the choice of the objective function is most often empirical and lacks statistical significance.
The following discusses the fixed bandwidth kernel density estimation method (see, e.g., D. W. Scott, “Multivariate Density Estimation”, New York, Wiley, 1992). Given a set
      {          x      i        }        i    =          1      ⁢                          ⁢      …      ⁢                          ⁢      n      of n points in a d-dimensional space Rd, the multivariate fixed bandwidth kernel density estimate with kernel K(x) and window radius (bandwidth) h, computed in the point x is defined by
                                          f            ^                    ⁡                      (            x            )                          =                              1                          n              ⁢                                                          ⁢                              h                d                                              ⁢                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                          K              ⁡                              (                                                      x                    -                                          x                      i                                                        h                                )                                                                        (        1        )            where the d-dimensional vectors
      {          x      i        }        i    =          1      ⁢                          ⁢      …      ⁢                          ⁢      n      represent a random sample from some unknown density f and the kernel, K, is taken to be a radially symmetric, non-negative function centered at zero and integrating to one. The terminology fixed bandwidth is due to the fact that h is held constant across x∈Rd. As a result, the fixed bandwidth procedure (1) estimates the density at each point x by taking the average of identically scaled kernels centered at each of the data points.
For pointwise estimation, the classical measure of the closeness of the estimator {circumflex over (f)} to its target value f is the mean squared error (MSE), equal to the sum of the variance and squared bias:
                              M          ⁢                                          ⁢          S          ⁢                                          ⁢                      E            ⁡                          (              x              )                                      =                                            E              ⁡                              [                                                                            f                      ^                                        ⁡                                          (                      x                      )                                                        -                                      f                    ⁡                                          (                      x                      )                                                                      ]                                      2                    =                                    V              ⁢                                                          ⁢              a              ⁢                                                          ⁢              r              ⁢                                                          ⁢                              (                                                      f                    ^                                    ⁡                                      (                    x                    )                                                  )                                      +                                          [                                  B                  ⁢                                                                          ⁢                  i                  ⁢                                                                          ⁢                  a                  ⁢                                                                          ⁢                                      s                    ⁡                                          (                                                                        f                          ^                                                ⁡                                                  (                          x                          )                                                                    )                                                                      ]                            2                                                          (        2        )            
Using the multivariate form of the Taylor theorem, the bias and the variance are approximated by:
                                          B            ⁢                                                  ⁢            i            ⁢                                                  ⁢            a            ⁢                                                  ⁢                          s              ⁡                              (                x                )                                              ≈                                    1              2                        ⁢                          h              2                        ⁢                                          μ                2                            ⁡                              (                K                )                                      ⁢            Δ            ⁢                                                  ⁢                          f              ⁡                              (                x                )                                                    ⁢                                  ⁢                                  ⁢        and                            (        3        )                                          Var          ⁡                      (            x            )                          ≈                              n                          -              1                                ⁢                      h                          -              d                                ⁢                      R            ⁡                          (              K              )                                ⁢                      f            ⁡                          (              x              )                                                          (        4        )            where
            μ      2        ⁡          (      K      )        =      ∫                  z        1        2            ⁢              K        ⁡                  (          z          )                    ⁢              ⅆ        z            and R(K)=∫K(z)dz are kernel dependent constants, z1 is the first component of the vector z, and  is the Laplace operator.
The tradeoff of bias versus variance can be observed in equations (3) and (4). The bias is proportional to h2, which means that smaller bandwidths give a less biased estimator. However, decreasing h implies an increase in the variance which is proportional to n−1h−d. Thus, for a fixed bandwidth estimator, h should be chosen so that an optimal compromise is achieved between the bias and variance over all x∈Rd, i.e., minimizes the minimum integrated squared error (MISE):
                              MISE          ⁡                      (            x            )                          =                  E          ⁢                      ∫                                                            (                                                                                    f                        ^                                            ⁡                                              (                        x                        )                                                              -                                          f                      ⁡                                              (                        x                        )                                                                              )                                2                            ⁢                                                ⅆ                  x                                .                                                                        (        5        )            
Nevertheless, the resulting bandwidth formula is of little practical use, since it depends on the Laplacian of the unknown density being estimated.
As noted above, an efficient data-driven methods for bandwidth selection is the “plug-in” rule, which has been proven to be superior to least squares cross validation and biased cross-validation. A practical one-dimensional algorithm based on the plug-in rule method is described below in Section A. For a discussion on the multivariate case, see M. P. Wand, et al., “Kernel Smoothing”, page 108, London: Chapman & Hall, 1995.
Note that these data-driven bandwidth selectors work well for multimodal data, their only assumption being a certain smoothness in the underlying density. However, the fixed bandwidth affects the estimation performance, by undersmoothing the tails and over-smoothing the peaks of the density. The performance also decreases when the data exhibits local scale variations.
There are known methods for estimating variable bandwidths (e.g., Balloon and Sample Point Estimators). In particular, according to expression (1), the bandwidth h can be varied in two ways. First, by selecting a different bandwidth h=h(x) for each estimation point x, one can define the balloon density estimator
                                                        f              1                        ^                    ⁡                      (            x            )                          =                              1                          n              ⁢                                                          ⁢                                                h                  ⁡                                      (                    x                    )                                                  d                                              ⁢                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                          K              ⁡                              (                                                      x                    -                                          x                      i                                                                            h                    ⁡                                          (                      x                      )                                                                      )                                                                        (        6        )            In this case, the estimate of f at x is the average of identically scaled kernels centered at each data point.
Second, by selecting a different bandwidth h=h(xi) for each data point xi, we obtain the sample point density estimator:
                                                        f              2                        ^                    ⁡                      (            x            )                          =                              1            n                    ⁢                                    ∑                              i                =                1                            n                        ⁢                                          1                                                                                          ⁢                                                            h                      ⁡                                              (                                                  x                          i                                                )                                                              d                                                              ⁢                                                          ⁢                              K                ⁡                                  (                                                            x                      -                                              x                        i                                                                                    h                      ⁡                                              (                                                  x                          i                                                )                                                                              )                                                                                        (        7        )            for which the estimate of f at x is the average of differently scaled kernels centered at each data point.
While the balloon estimator has more intuitive appeal, its performance improvement over the fixed bandwidth estimator is insignificant. When the bandwidth h(x) is chosen as a function of the k-th nearest neighbor, the bias and variance are still proportional to h2 and n−1h−d, respectively. In addition, the balloon estimators usually fail to integrate to one.
The sample point estimators, on the other band, are themselves densities, being non-negative and integrating to one. Their most attractive property is that particular choice of h(xi) considerably reduces the bias. Indeed, when h(xi) is taken to be reciprocal to the square root of f(xi)
                              h          ⁡                      (                          x              i                        )                          =                                            h              0                        ⁡                          [                              λ                                  f                  ⁡                                      (                                          x                      i                                        )                                                              ]                                            1            /            2                                              (        8        )            the bias becomes proportional to h4, while the variance remains unchanged, proportional to n−1h−d. In equation (8), h0 represents a fixed bandwidth and 8 is a proportionality constant.
Since f(xi) is unknown, it has to be estimated from the data. The practical approach is to use one of the methods described in above to find h0 and an initial estimate of the density (called the pilot estimate) of f denoted by {tilde over (f)}. Note that by using {tilde over (f)} instead of f in equation (8), the nice properties of the sample point estimators in equation (7) remain unchanged. It is known that the method is insensitive to the fine detail of the pilot estimate. The only provision that should be taken is to bound the pilot density away from zero.
The final estimate (equation (7)), however, is influenced by the choice of the proportionality constant 8, which divides the range of density values into low and high densities. When the local density is low, i.e., {tilde over (f)}(xi)<λ, h(xi) increases relative to h0 implying more smoothing for the point xi. For data points that verify {tilde over (f)}(xi)>λ, the bandwidth becomes narrower.
A good initial choice is to take 8 as the geometric mean of
            {                        f          ~                ⁡                  (                      x            i                    )                    }              i      =              1        ⁢                                  ⁢        …        ⁢                                  ⁢        n              .Experiments have shown that for superior results, a certain degree of tuning is required for 8. Nevertheless, the sample point estimator has been proven to be better than the fixed bandwidth estimator.
One fixed bandwidth method that has been proposed for the detection of modes is the “Mean Shift” method, the efficacy of which has been demonstrated in computer vision problems such as tracking and segmentation (see, e.g., Comaniciu, et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 24, no. 5. pp. 603–619, 2002; and K. Fukunaga, et. al., “The Estimation of the Gradient of a Density Function,” IEEE Trans. Info. Theory, Vol. IT-21, 32–40, 1975).
A limitation of the mean shift procedure is that it involves the specification of a scale parameter. While results obtained appear satisfactory, when the local characteristics of the feature space differs significantly across data, it is difficult to find an optimal global bandwidth for the mean shift procedure.
Based on the above, there is a continuing need for fast and reliable methods for data-driven automatic bandwidth selection for purposes of feature space partitioning and analysis in, e.g., image processing applications.