Hierarchical clustering is an important tool for understanding the relationships (e.g., similarities and differences) between samples in a dataset, and is routinely used in the analysis of relatively small datasets (e.g., when the number of samples is less than 200). Hierarchical clustering organizes a set of samples into a hierarchy of clusters, based on the distances of the clusters from one another in the variable or measurement space. This hierarchy is represented in the form of a tree or dendrogram.
FIG. 1A illustrates a dataset 100 composed of six samples A-F where each sample is characterized by two variables or dimensions X and Y. The samples A-F have been plotted in the two-dimensional variable space 105. In other words, the plotted position of each sample A-F within space 105 is representative of that sample's measured values for the variables X and Y. FIG. 1B illustrates a dendrogram 110 with the individual samples A-F at one end, such that each sample forms its own cluster (LEVEL 0), and a single cluster C5 containing every sample at the other end (LEVEL 5). Each successive level of dendrogram 110 illustrates the relative proximity of clusters formed from samples A-F within space 105 using Euclidian distance as measured using the vector space X and Y. At LEVEL 0 each sample forms its own cluster, at LEVEL 1 the two closest samples are clustered together (i.e., samples 13 and C in cluster C1). Dendrogram 110 continues until all samples A-F are grouped into the single cluster C5.
Hierarchical clustering, however, is typically not applied to hyperspectral images or other large data sets due to computational and computer storage limitations. Hyperspectral image sets are characterized by a large number of samples or pixels (for example, typically greater than 10,000) and a large number of variables or spectral channels (for example greater than 100). Conventional hierarchical clustering techniques require the calculation and updating of a pair wise cluster dissimilarity matrix. The cluster dissimilarity matrix stores the distance between each pair of clusters comprising a data set, and can be used to facilitate hierarchical clustering.
A problem arises, however, in calculating and storing the cluster dissimilarity matrix for a large data set. As a case in point, for a hyperspectral image set composed of 10,000 pixels, the corresponding cluster dissimilarity matrix would initially be of dimensions 10,000 by 10,000, resulting in out-of-memory errors on a standard desktop computer. For datasets where the number of samples ranges from approximately 2,000 to 8,000, conventional hierarchical clustering techniques require anywhere from several hours to days to complete the desired dendrogram, due to the high computational overhead in calculating and updating the cluster dissimilarity matrix.
It is desirable in view of the foregoing to provide for improvements in analysis of large data sets with hierarchical clustering.