The term “clustering” refers to the task of partitioning at least one collection of data items into different groups (referred to as “clusters”), such that the data items in each group might share certain properties or characteristics that may not exist among the data items in other groups.
The clusters resulting from clustering a collection of data items (referred to as a “dataset”) should capture the natural structures present in the dataset, facilitating a better understanding of the data. Clustering is often challenging because datasets usually contain outliers and noise which can be difficult to identify and remove.
There are various applications for the clustered data resulting from clustering, such as image processing, pattern discovery and market research. The benefit of clustering over manual sorting can be a reduction in the labour and time that would otherwise be required to manually sort or label a dataset.
The term “distance” refers to the measurable degree of similarity between data items, such that data items having a small distance between one another have a high degree of similarity, and data items having a relatively larger distance between one another have relatively less similarity.
A good clustering solution should provide robustness to both intra- and inter-class variations. That is, items which belong to known classes should have small distances between one another and therefore be grouped in similar clusters, and items in different known classes should have larger distances between one another and as a result fall into different clusters.
One type of cluster analysis is called “connectivity-based clustering”. According to some methods of connectivity-based clustering, clustering is achieved by taking as inputs pairwise distances between data items, and then clustering data generally according to the principle that items having low distance between one another (i.e. high similarity) tend to be clustered together. One example of this type of clustering is referred to as “hierarchical clustering”, wherein different clusters are formed at various levels of distance values, resulting in a dendrogram representation of data.
Another clustering method is called “affinity propagation”, wherein message-passing inference is performed on pairwise distance inputs. It is capable of selecting representative items from a dataset and automatically determining the optimal number of clusters.
Other clustering methods include centroid-based (e.g., K-means), distribution-based (e.g., Gaussian Mixture Models) and graph-based (e.g., Spectral Clustering) methods.