Clustering is an important known technique in exploratory data analysis, where a priori knowledge of the distribution of the observed data is not available. Known prior art Partitional clustering methods, that divide the data according to natural classes present in it, have been used in a large variety of scientific disciplines and engineering applications that include pattern recognition, learning theory, astrophysics, medical image and data processing, image compression, satellite data analysis, automatic target recognition, as well as, speech and text recognition, and other types of data analysis.
The goal in collecting and processing data is to find a partition of a given data set into several groups. Each group indicates the presence of a distinct category in the data. The problem of partitional clustering is usually formally stated as follows. Determine the partition of N given patterns {v.sub.i }.sub.i=1.sup.N into groups, called clusters, such that the patterns of a cluster are more similar to each other than to patterns in different clusters. It is assumed that either d.sub.ij, the measure of dissimilarity between patterns v.sub.i and v.sub.j is provided, or that each pattern v.sub.i is represented by a point x.sub.i in a D-dimensional metric space, in which case d.sub.ij =.vertline.x.sub.i -x.sub.j .vertline..
The two main approaches to partitional clustering are called parametric and non-parametric. In parametric approaches some knowledge of the clusters' structure is assumed, and, in most cases, patterns can be represented by points in a D-dimensional metric space. For instance, each cluster can be parameterized by a center around which the points that belong to it are spread with a locally Gaussian distribution. In many cases the assumptions are incorporated in a global criterion whose minimization yields the "optimal" partition of the data. The goal is to assign the data points so that the criterion is minimized. Classical prior art approaches or techniques are variance minimization, maximal likelihood and fitting Gaussian mixtures. A nice example of variance minimization is the method based on principles of statistical physics, which ensures an optimal solution under certain conditions. This method has given rise to other mean-field methods for clustering data. Classical examples of fitting Gaussian mixtures are the Isodata algorithm or its sequential relative, the K-means algorithm in statistics, and soft competition in neural networks.
In many cases of interest, however, there is no a priori knowledge about the data structure. Then, one usually adopts non-parametric approaches, which make less assumptions about the model, and, therefore, are suitable to handle a wider variety of clustering problems. Usually these methods employ a local criterion, against which some attribute of the local structure of the data is tested, to construct the clusters. Typical examples are hierarchical techniques such as agglomerative and divisive methods. These algorithms suffer, however, from at least one of the following limitations: (a) high sensitivity to initialization; (b) poor performance when the data contains overlapping clusters; (c) inability to handle variabilities in cluster shapes, cluster densities and cluster sizes. The most serious problem is the lack of cluster validity criteria; in particular, none of these methods provides an index that could be used to determine the most significant partitions among those obtained in the entire hierarchy. All of the algorithms used in these techniques tend to create clusters even when no natural clusters exist in the data.