Data clustering as a problem in pattern recognition and statistics belongs to the class of unsupervised learning. It essentially involves the search through the data for observations that are similar enough to be grouped together. There is a large body of literature on this topic. Algorithms from graph theory, matrix factorization, deterministic annealing, scale space theory, and mixture models have all been used to delineate relevant structures within the input data.
However, the clustering task is inherently subjective. There is no accepted definition of the term “cluster” and any clustering algorithm will produce some partitions. Therefore, the ability to statistical characterize the decomposition and to assess the significance of the resulting number of clusters is an important aspect of the problem.
Approaches for estimating the number of clusters can be divided into global and local methods. The former evaluate some measure over the entire data set and optimize it as function of the number of clusters. The latter consider individual pairs of clusters and test whether they should be joined together. A general descriptions of methods used to estimate the number of clusters are provided in the literature, while one study conducts a Monte Carlo evaluation of 30 indices for cluster validation. These indices are typically functions of the “within” and “between” cluster distances and belong to the class of “internal” measures, in the sense that they are computed from the same observation used to create a partition. Consequently, their distribution is intractable and they are not suitable for hypothesis testing.
Thus, the majority of existing methods for estimating the validity of the decomposition do not attempt to perform a formal statistical procedure, but rather look for a clustering structure under which the statistic of interest is optimal, such as maximization or minimization of an objective function. Validation methods that do not suffer from this limitation were recently proposed, but are computationally expensive since they require simulating multiple datasets from the null distribution.