Clustering is a widely used method to group data entities into subsets called clusters such that the entities in each cluster are similar in some way. A powerful feature of clustering algorithms is that they can generate clusters without any pre-defined labels or categories, which makes them an ideal choice for analyzing data with little or no a priori information. Unlike classification, in which categories with clear semantic meanings are pre-defined, clustering by definition works without these initial constraints on how data entities should be grouped. Users are only required to choose a distance function (e.g., Euclidean distance) that measures how similar two data items are in a feature space, and some other parameters such as the number of clusters or a maximum cluster diameter. Clustering algorithms will then automatically partition data.
While this clustering technique is powerful, users often have difficulty understanding the semantic meaning of the resulting clusters and evaluating the quality of the results, especially for high-dimensional data. There are several issues which make understanding and evaluating clustering results difficult. First, for high-dimensional data, the entities that are grouped together are close in a high-dimensional feature space. However, their similarity may be mainly because of their closeness on a subset of dimensions instead of all dimensions. Understanding these abstract relationships can be challenging. Moreover, a cluster may contain several different sub-clusters that have different semantic meanings for users. This sub-cluster structure is usually hard to detect.
Second, as unsupervised learning processes use no semantic knowledge or pre-defined categories, clustering algorithms often require users to input some parameters in advance. For example, users must provide the number of clusters (i.e., k) for the well known K-means algorithm. However, it is challenging to select a proper k value for the underlying data. Therefore, algorithms such as K-means algorithms might group together entities that are semantically different (when k is smaller than the real number of clusters) or separate entities that are semantically similar (when k is larger than the real number of clusters). Thus, users need some way to evaluate and refine the clustering results.
Information visualization can be of great value in addressing these issues. For example, techniques such as scatter plot matrices, parallel coordinates, and RadViz have been used to visually explain the results of clustering algorithms. Some algorithms focus on revealing the multi-attribute values of clusters to help users understand the semantic meaning of clusters while others provide visual cues for the cluster quality. However, none of these techniques offer a complete solution for cluster interpretation, evaluation, and refinement.
A need therefore exists for a visualization technique that allows users to understand the semantic meaning of various clusters, evaluate their qualities, compare different clusters, and refine clustering results as necessary. A further need exists for a visualization technique that can be embedded into various visual displays or presentations.