Data clustering is important in a variety of fields including data mining, statistical data analysis, data compression, and vector quantization. Clustering has been formulated in various ways in the machine learning, pattern recognition, optimization, and statistics literature. The general agreement is that the problem is about finding groups (clusters) in data that consist of data items which are similar to each other. The most general definition of the clustering problem is to view it as a density estimation problem. The value of the hidden cluster variable (the cluster ID) specifies the model from which the data items that belong to that cluster are drawn. Hence the data is assumed to arrive from a mixture model and the mixing labels (cluster identifiers) are hidden.
In general, a mixture model M having K clusters Ci, i=1 . . . , K, assigns a probability to a data point x as follows: ##EQU1## where W.sub.i are called the mixture weights. The problem of clustering is identifying the properties of the clusters Ci. Usually it is assumed that the number of clusters K is known and the problem is to find the best parameterization of each cluster model. A popular technique for estimating the parameters is the EM algorithm.
There are various approaches to performing the optimization problem of finding a good set of parameters. The most effective class of methods is known as the iterative refinement approach. The basic algorithm goes as follows:
1. Initialize the model parameters, producing a current model. PA1 2. Decide memberships of the data items to clusters, assuming that the current model is correct. PA1 4. If the current model and new model are sufficiently close to each other, terminate, else go to 2.
3. Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing new model.
As an example a so-called K-Means clustering evaluation starts with a random choice of cluster centroids or means for the clusters. In a one dimensional problem this is a single number for the average of the data points in a given cluster but in an n dimensional problem, the mean is a vector of n components. Data items are gathered and are assigned to a cluster to based on the distance to the cluster centroid. Once the data points have been assigned the centroids are recalculated and the data points are again reassigned. Since the centroid location (in n dimensions) will change when the centroids are recalculated (recall they were randomly assigned the first iteration and its unlikely they are correct) some data points will switch centroids. The centroids are then again calculated. This process terminates when the assignments and hence centroids cease to change. The output from the K-means clustering process is K centroids and the number of data points that fall within a given centroid.
The present invention is concerned with step 1: the initialization step of choosing the starting centroids.
The most widely used clustering procedures in the pattern recognition and statistics literature are members of the above family of iterative refinement approaches: the K-means algorithm, and the EM algorithm. The difference between the EM and K-means is essentially in the membership decision in step 2. In K-means, a data item is assumed to belong to a single cluster, while in the EM procedure each data item is assumed to belong to every cluster but with a different probability. This of course affects the update step (3) of the algorithm. In K-means each cluster is updated based strictly on its membership. In EM each cluster is updated by the entire data set according to fractional memberships determined by the relative probability of membership.
Note that given the initial conditions of step 1, the algorithm is deterministic and the solution is determined by the choice of an initial or starting point. In both K-means and EM, there is a guarantee that the procedure will converge after a finite number of iterations. Convergence is to a local minimum of the objective function (likelihood of the data given the model) and the particular local minimum is determined by the initial starting point (step 1). It has been well-known that clustering algorithms are extremely sensitive to initial conditions. Most methods for guessing an initial solution simply pick a random guess. Other methods place the initial means or centroids on uniform intervals in each dimension. Some methods take the mean of the global data set and perturb it K times to get the K initial means, or simply pick K random points from the data set. In most situations, initialization is done by randomly picking a set of starting points from the range of the data.
In the above clustering framework, a solution of the clustering problem is a parameterization of each cluster model. This parametrization can be performed by determining the modes (maxima) of the joint probability density of the data and placing a cluster at each mode. Hence one approach to clustering is to estimate the density and then proceed to find the "bumps" in the estimated density. Density estimation using some technique like kernel density estimation is difficult, especially in high-dimensional spaces. Bump hunting is also difficult to perform.
The K-means clustering process is a standard technique for clustering and is used in a wide array of applications in pattern recognition, signal processing, and even as a way to initialize the more expensive EM clustering algorithm. The K-means procedure uses three inputs: the number of clusters K, a set of K initial starting points, and the data set to be clustered. Each cluster is represented by its mean (centroid). Each data item is assigned membership in the cluster having the nearest mean to it (step 2). Distance is measured by the Euclidean distance (or L2 norm).
For a data point (d-dimensional vector) x and mean .mu., the distance is given by: ##EQU2## A cluster model is updated by computing the mean of its members (step 3).
To specify the algorithm in the framework introduced so far we need to describe the model used. The model at each cluster is assumed be a Gaussian. For each cluster, the Gaussian is centered at the mean of the cluster. It is assumed to have a diagonal covariance with constant entries on the diagonals of all the clusters. Note that a harsh cluster membership decision for a data point leads to assigning a point to the nearest cluster (since in this case the Euclidean distance is proportional to probability assigned to the point by the cluster). Finally, the mixture weights (W.sub.i) in K-means are all assumed equal.
Note that by this definition, K-means is only defined over numeric (continuous-valued) data since the ability to compute the mean is a requirement. A discrete version of K-means exists and is sometimes referred to as harsh EM.