This application claims the foreign priority benefits under 35 U.S.C. xc2xa7119 of European application No. 02006029.9 filed on Mar. 16, 2002, which is incorporated herein by reference.
1. Field of the Invention
The present invention relates to the field of data clustering and in particular to clustering algorithms and quality determination.
2. Background and Prior Art
Clustering of data is a data processing task in which clusters are identified in a structured set of raw data. Typically the raw data consists of a large set of records each record having the same or a similar format. Each field in a record can take any of a number of logical, categorical, or numerical values. Data clustering aims to group such records into clusters such that records belonging to the same cluster have a high degree of similarity.
A variety of algorithms are known for data clustering. The K-means algorithm relies on the minimal sum of Euclidean distances to center of clusters taking into consideration the number of clusters. The Kohonen-algorithm is based on a neural net and also uses Euclidean distances. IBM""s demographic algorithm relies on the sum of internal similarities minus the sum of external similarities as a clustering criterion. Those and other clustering criteria are utilized in an iterative process of finding clusters.
A common disadvantage of such prior art clustering algorithms is that different clustering algorithms applied to the same set of data may deliver largely different results. Even if the same algorithm is applied to the same set of data using a different set of parameters as a starting condition a different result is likely to occur. In the prior art no objective criterion exists to compare the results of such clustering operations.
One field of application of data clustering is data mining. From U.S. Pat. No. 6,112,194 a method for data mining including a feedback mechanism for monitoring performance of mining tasks is known. A user selected mining technique type is received for the data mining operation. A quality measure type is identified for the user selected mining technique type. The user selected mining technique type for the data mining operation is processed and a quality indicator is measured using the quality measure type. The measured quality indication is displayed while processing the user selected mining technique type for the data mining operations.
From U.S. Pat. No. 6,115,708 a method for refining the initial conditions for clustering with applications to small and large database clustering is known. It is disclosed how this method is applied to the popular K-means clustering algorithm and how refined initial starting points indeed lead to improved solutions. The technique can be used as an initializer for other clustering solutions. The method is based on an efficient technique for estimating the modes of a distribution and runs in time guaranteed to be less than overall clustering time for large data sets. The method is also scalable and hence can be efficiently used on huge databases to refine starting points for scalable clustering algorithms in data mining applications.
From U.S. Pat. No. 6,100,901 a method for visualizing a multi-dimensional data set in which the multi-dimensional data set is clustered into k clusters, with each cluster having a centroid is known. Either two distinct current centroids or three distinct non-collinear current centroids are selected. A current 2-dimensional cluster protection is generated based on the selected current centroids. In the case when two distinct current centroids are selected, two distinct target centroids are selected, with at least one of the two target centroids being different from the two current centroids.
From U.S. Pat. No. 5,857,179 a computer method for clustering documents and automatic generation of cluster keywords is known. An initial document by term matrix is formed, each document being represented by a respective M dimensional vector, where M represents the number of terms or words in a predetermined domain of documents. The dimensionality of the initial matrix is reduced to form resultant vectors of the documents. The resultant vectors are then clustered such that correlated documents are grouped into respective clusters. For each cluster, the terms having greatest impact on the documents in that cluster are identified. The identified terms represent key words of each document in that cluster. Further, the identified terms form a cluster summary indicative of the documents in that cluster.
A principal object of the present invention is to provide a method, data processing system and computer program product for data clustering and quality determination such that the qualities of clustering results can be compared on an objective basis. The quality index for a clustering result obtained in accordance with the invention is independent of the clustering algorithm used.
Rather than relying on the clustering algorithm itself for quality determination the invention relies on a statistical analysis of the clustering result to determine the quality of the clustering.
It is a particular advantage of the present invention that the quality measure is objective, i.e. independent of the method employed to perform the clustering and that it is normalized. This is why the present invention can be employed for any clustering method. Further the results provided by different clustering methods can be compared in an objective way in order to identify clustering results having a high quality.
In accordance with a preferred embodiment of the present invention a quality measure is determined for an individual cluster of the data clustering result by means of a set of observed values. The set of observed values is determined by mapping the cluster identifier of the cluster for which the quality measure is to be determined to a predefined numerical value such as xe2x80x9c1xe2x80x9d.
The cluster identifiers of the other clusters of the clustering result are mapped to another predefined numerical value such as xe2x80x9c0xe2x80x9d. One way of creating the set of observed values for the purposes of determining the quality measure for one of the clusters is to organize the data records which have been clustered into a table comprising the attribute values for each of the records, the cluster identifier which has been assigned to the records and an additional column for the mapped cluster identifiers.
By means of this set of observed values comprising the attributes values and the mapped cluster identifiers for each of the records, which have been clustered, a normalized statistical coefficient is calculated.
In accordance with a preferred embodiment of the invention the normalized statistical coefficient is the R squared coefficient. The R squared coefficient is also called the xe2x80x9ccoefficient of determinationxe2x80x9d. The R squared coefficient is the square of Pearson""s correlation coefficient.
Pearson""s correlation coefficient is as such known from statistics. It is used in the prior art for comparisons between different data sets. Alternatively Spearman""s correlation coefficient is used instead of Pearson""s correlation coefficient.
In accordance with a further preferred embodiment of the invention an overall quality measure is calculated for the result of the data clustering integrating the individual quality measures obtained separately for the individual clusters. This is done by calculating a weighted average of the quality measures of the clusters. The number of records within a given cluster serves as the weighting factor (i.e. weighting coefficient).
Further the present invention also enables to improve a given data clustering method. This is done by integrating the quality determination of the data clustering result provided by a given data clustering method within the data clustering procedure. For example after the data clustering has been performed in a first iteration the quality is determined for each of the clusters.
Those clusters, which have a low quality measure, are selected to improve the quality of the clustering. This can be done by hierarchical clustering. To perform the hierarchical clustering the cluster results of the first iteration are subjected to a successive further clustering operation.
To perform the successive clustering operation the same or another data clustering method as in the first iteration can be selected. After the further clustering has been done for the selected clusters the quality is determined again to check if the quality has improved. If necessary further iterations are performed until a sufficient quality measure has been reached.