This invention relates generally to data mining.
Data mining involves the statistical analysis of complex data. In one application, data mining technology may be utilized to cluster data into similar groups. Clustering of data is used in many areas, such as video, imaging and audio compression and scientific applications, among many others.
A data set may include a collection of data points which each has a set of features. For example, a data set may include a collection of “N” data points, each of which has “M” features. Supervised data contains labels or predictors, while unsupervised data lacks such labels or predictors. That is, certain data sets may contain a collection of features and a label or predictor for those features. As an example, a supervised data set may include a collection of features about mushrooms, such as cap type, color, texture, and so on, and a label such as edible, poisonous, medicinal, and so on, or a predictor, such as a numeral value representing the toxicity of a mushroom. A related unsupervised data set may include the collection of features without the labels or predictors.
Hierarchical clustering techniques can be used to cluster data, and particularly for clustering unsupervised data. Such techniques are usually performed as two-way merges (i.e., from a bottom-up) or as splits (i.e., from a top-down) of a data set. Each merger or split represents a branching point. That is, each of the splits is a pair-wise clustering of data. While such techniques are used to cluster data, they do not reflect a natural structure of many data sets. Further, clustering typically requires pre-specification of parameters for the clustering, such as a desired number of clusters.
Thus a need exists to more efficiently cluster data.