Clustering of data is a data processing task in which clusters are identified in a structured set of raw data. Typically, the raw data comprises a large set of records with each record having the same or a similar format. Each field in a record can take any of a number of categorical or numerical values. Data clustering aims to group these records into clusters such that records belonging to the same cluster have a high degree of similarity.
A variety of algorithms are known for data clustering. The K-means algorithm relies on the minimal sum of Euclidean distances to centers of clusters taking into consideration the number of clusters. The Kohonen-algorithm is based on a neural net and also uses Euclidean distances. IBM's demographic algorithm relies on the sum of internal similarities minus the sum of external similarities as a clustering criterion. Those and other clustering criteria are utilized in an iterative process of finding clusters.
One field of application of data clustering is data mining. U.S. Pat. No. 6,112,194 describes a method for data mining including a feedback mechanism for monitoring performance of mining tasks is known. A user selected mining technique type is received for the data mining operation. A quality measure type is identified for the user selected mining technique type. The user selected mining technique type for the data mining operation is processed and a quality indicator is measured using the quality measure type. The measured quality indication is displayed while processing the user selected mining technique type for the data mining operations.
U.S. Pat. No. 6,115,708 describes a method for refining the initial conditions for clustering with applications to small and large database clustering is known. It is disclosed how this method is applied to the popular K-means clustering algorithm and how refined initial starting points indeed lead to improved solutions. The technique can be used as an initializer for other clustering solutions. The method is based on an efficient technique for estimating the modes of a distribution and runs in time guaranteed to be less than overall clustering time for large data sets. The method is also scalable and hence can be efficiently used on huge databases to refine starting points for scalable clustering algorithms in data mining applications.
U.S. Pat. No.6,100,901 describes a method for visualizing a multi-dimensional data set is known in which the multi-dimensional data set is clustered into k clusters, with each cluster having a centroid. Either two distinct current centroids or three distinct non-collinear current centroids are selected. A current 2-dimensional cluster projection is generated based on the selected current centroids. In the case when two distinct current centroids are selected, two distinct target centroids are selected, with at least one of the two target centroids being different from the two current centroids.
U.S. Pat. No. 5,857,179 describes a computer method for clustering documents and automatic generation of cluster keywords is known. An initial document by term matrix is formed, each document being represented by a respective M-dimensional vector, where M represents the number of terms or words in a predetermined domain of documents. The dimensionality of the initial matrix is reduced to form resultant vectors of the documents. The resultant vectors are then clustered such that correlated documents are grouped into respective clusters. For each cluster, the terms having greatest impact on the documents in that cluster are identified. The identified terms represent key words of each document in that cluster. Further, the identified terms form a cluster summary indicative of the documents in that cluster.
Further, a variety of supervised learning techniques is known from the prior art of neural networks. Supervised learning requires input and resulting output pairs to be presented to the network during the training process. Back propagation, for example, uses supervised learning and makes adjustments during training so that the value computed by the neural network will approach the actual value as the network learns from the data presented. Supervised learning is used in the techniques provided for predicting classification, as well as for predicting numerical values.
Cohn, D. et al., “Semi-Supervised Clustering With User Feedback,” AAAI 2000, describes a clustering approach is known where the user can iteratively provide feedback to the clustering algorithm after each clustering step. The disadvantage of this clustering approach is that the clustering needs to be performed iteratively which requires a disproportional amount of processing power and time. Another disadvantage is that the user must select suitable pairs of data records from a typically very large set of records.
What is therefore needed is a system and associated method for determining input parameters for a clustering algorithm that will minimize the number of processing iterations and processing time while maximizing processing resources. The need for such a system has heretofore remained unsatisfied.