The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than to data points in different clusters.
Existing clustering methods can be broadly classified into partitional and hierarchical methods. Partitional clustering attempts to determine k partitions that optimize a certain criterion function. The square-error criterion is the most commonly used.
The square-error criterion is a good measure of the within-cluster variation across all the partitions. The objective is to find k partitions that minimize the square-error. Thus, square-error clustering tries to make the k clusters as compact and separated as possible, and works well when clusters are compact clouds that are rather well separated from one another. However, when there are large differences in the sizes or geometries of different clusters, as illustrated in FIGS. 1a, the square-error method could split large clusters to minimize the square-error (see FIG. 1b). The shading illustrated in FIGS. 1a and 1b indicate the clustering of the data contained within the spheres. Accordingly, in FIG. 1a, each sphere contains only one shading (the appropriate clustering) while in FIG. 1b, the larger sphere has three separate shadings representing three separate clusters for the data points within the sphere.
Hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. An agglomerative method for hierarchical clustering starts with the disjoint set of clusters, which places each input data point in an individual cluster. Pairs of clusters are then successively merged until the number of clusters reduces to k. At each step, the pair of clusters merged are the ones between which the distance is the minimum. There are several measures used to determine distances between clusters.
For example, pairs of clusters whose centroids or means are the closest are merged in a method using the mean as the distance measure (d.sub.mean). This method is referred to as the centroid approach. In a method utilizing the minimum distance as the distance measure, the pair of clusters that are merged are the ones containing the closest pair of points (d.sub.min). This method is referred to as the all-points approach.
The above distance measures have a minimum variance and usually yield the same results if the clusters are compact and well-separated. However, if the clusters are close to one another or close to the same set of outliers (an outlier is a noise impulse which is locally inconsistent with the rest of the data), or if their shapes and sizes are not hyperspherical and uniform, the results of clustering can vary quite dramatically. For example, with the data set shown in FIG. 1a, using the aforementioned centroid approach (d.sub.mean), the distance measure results in clusters that are similar to those obtained by the square-error method shown in FIG. 1b. As stated earlier, the shading illustrated in FIGS. 1a and 1b indicate the clustering of the data contained within the spheres. Accordingly, in FIG. 1a, each sphere contains only one shading (the appropriate clustering) while in FIG. 1b, the larger sphere has three separate shadings representing three separate clusters for the data points within the sphere.
As another example, consider the desired elongated clusters illustrated in FIG. 2a. The shading illustrated in FIGS. 2a to 2c indicate the clustering of the data contained within the elongated ovals. FIG. 2a illustrates each oval as containing only one shading (the appropriate clustering) representing 6 separate oval shaped clusters. The small dark region connecting the first two ovals represents outliers that do not belong to any of the 6 separate oval shaped clusters.
Using d.sub.mean as the distance measure, however, causes the elongated clusters to be split and portions belonging to neighboring elongated clusters to be merged. The resulting clusters are as shown in FIG. 2b. In FIG. 2b, as indicated by the shading of the ovals, the data within each oval has been split into at least two clusters. The region of outliers has also been associated with a portion of the first and second ovals.
On the other hand, with d.sub.min as the distance measure, the resulting clusters are as shown in FIG. 2c. As indicated by the shading, the points of the two elongated ovals connected by the narrow string of outlier points were merged into a single cluster. This cluster includes the outlier points as well. This "chaining effect" is a drawback of using d.sub.min, that is, a few points located such that they form a bridge between two clusters causes points across the clusters to be grouped into a single elongated cluster.
From the above discussion, it follows that neither the centroid-based approach (that uses d.sub.mean) nor the all-points approach (based on d.sub.min) work well for non-spherical or arbitrary shaped clusters. A shortcoming of the centroid-based approach is that it considers only one point as representative of a cluster, that is, the cluster centroid. For a large or arbitrary shaped cluster, the centroids of its subclusters can be reasonably far apart, thus causing the cluster to be split. The all-points approach, on the other hand, considers all the points within a cluster as representative of the cluster. This other extreme, has its own drawbacks, since it makes the clustering method extremely sensitive to outliers and to slight changes in the position of data points.
In addition, when the number N of input data points is large, hierarchical clustering methods break down due to their non-linear time complexity (typically, O(N.sup.2)) and huge I/O costs. Time complexity is normally expressed as an order of magnitude O(), and therefore, a time complexity of O(N.sup.2) indicates that if the size of the input N doubles then the method will take four times as many steps to complete. In order to remedy this problem, recent clustering methods initially perform a preclustering phase in which dense regions of points are represented by compact summaries, and then a centroid-based hierarchical approach is used to cluster the set of summaries (which is much smaller than the original data set).
For example, Zhang et al., in An Efficient Data Clustering Method for Very Large Databases, Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996, refers to a preclustering method called BIRCH. In BIRCH, the preclustering approach to reduce input size is incremental and approximate. During preclustering, the entire database is scanned, and cluster summaries are stored in memory in a data structure called the CF-tree. For each successive data point, the CF-tree is traversed to find the closest cluster to it in the tree, and if the point is within a threshold distance of the closest cluster, it is absorbed into it. Otherwise, it starts its own cluster in the CF-tree.
Once the clusters are generated, a final labeling phase is carried out in which using the centroids of clusters as seeds, each data point is assigned to the cluster with the closest seed. Using only the centroid of a cluster when redistributing the data in the final phase has problems when clusters do not have uniform sizes and shapes (see FIGS. 3a-3b). The shading illustrated in FIGS. 3a and 3b indicate the cluster labeling of the data contained within the spheres. In FIG. 3a, each sphere contains only one shading (the appropriate cluster labeling) while in FIG. 3b, the larger sphere has two separate shadings representing two separate cluster labels for the data points within the sphere. One shading of the larger sphere matches the shading of the smaller sphere indicating that data from the larger sphere were labeled as belonging to the cluster of data points within the smaller sphere. This labeling occurred in the final labeling phase because some of the points in the larger cluster were closer to the centroid of the smaller cluster than they were to the centroid of the larger cluster (see FIG. 3b).
Another approach utilizes a partitional clustering method for large databases which is based on a randomized search. For example, Raymond T. Ng and Jiawei Han, in Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the VLDB Conference, Santiago, Chile, September 1994, refers to a clustering method called CLARANS. In CLARANS, each cluster is represented by its medoid, the most centrally located point in the cluster, and the objective is to find the k best medoids that optimize the criterion function. The approach reduces this problem to that of graph search by representing each set of k medoids as a node in the graph, two nodes being adjacent if they have k-1 medoids in common. Initially, an arbitrary node is set to be the current node and a fixed number of iterations are performed. In each iteration, a random neighbor of the current node is set to be the current node if it results in better clustering. The computation of the criterion function for the random neighbor requires the entire database to be examined.
Another approach uses an R*-tree to improve the I/O efficiency of randomized searches on large databases by drawing samples from leaf pages to reduce the number of data points (since data points are packed in leaf nodes based on spatial locality, a sample point in the leaf page can be a good representative point), and focusing on relevant points when evaluating the "goodness" of a neighbor.
These methods work well for convex or spherical clusters of uniform size. However, they are unsuitable when clusters have different sizes (see FIGS. 1a-1b), or when clusters are non-spherical (see FIGS. 2a-2c).
Density-based methods have been used in an attempt to cluster arbitrary shaped collections of points (e.g., ellipsoid, spiral, cylindrical). The density-based method requires the user to specify two parameters that are used to define minimum density for clustering--the radius Eps of the neighborhood of a point and the minimum number of points MinPts in the neighborhood. Clusters are then found by starting from an arbitrary point and if its neighborhood satisfies the minimum density, including the points in its neighborhood into the cluster. The process is then repeated for the newly added points.
While a density-based method can find clusters with arbitrary shapes, it suffers from a number of problems. For example, a density-based method is very sensitive to the parameters Eps and MinPts, which in turn, are difficult to determine. Furthermore, the method also suffers from the robustness problems that plague the all-points hierarchical clustering method, that is, in case there is a dense string of points connecting two clusters, the density-based method could end up merging the two clusters. Also, the method does not perform any preclustering and executes directly on the entire database. As a result, for large databases, the density-based approach could incur substantial I/O costs. Finally, with density-based methods, using random sampling to reduce the input size may not be feasible since there could be substantial variations in the density of points within each cluster in the random sample.
Accordingly, a clustering method that can identify clusters having non-spherical shapes and a wide variety of sizes, while efficiently handling outliers, is still needed.