1. Field of Invention
Embodiments of the present invention relate generally to methods and systems adapted to cluster categorical data. More specifically, embodiments of the present invention relate to methods and systems adapted to identify an optimal cluster set in a hierarchy of clusters.
2. Discussion of the Related Art
Data is often organized in a clustering process by separating an arbitrary dataset into a plurality of subsets (i.e., clusters) according to some inherent similarity measure between data within the dataset, wherein data within a particular cluster is characterized by some common trait or attribute. Subsequently, category labels are generated using the clusters and a classifier for the dataset is constructed using the category labels. Clustering processes can be characterized according to the manner in which they form clusters. Two common clustering techniques include partitional and hierarchical techniques.
Partitional clustering techniques organize a dataset into a single collection of clusters that usually do not overlap, wherein data within each cluster is uniformly similar. Hierarchical clustering algorithms, on the other hand, create a hierarchy of clusters representing a range (e.g., from coarse to fine) of intra-cluster similarity. Hierarchical clustering algorithms are generally classified according to the manner in which they construct the cluster hierarchy. Thus, agglomerative hierarchical clustering algorithms build the cluster hierarchy from the bottom up by progressively merging smaller clusters into larger clusters while divisive hierarchical clustering algorithms build the hierarchy from the top down by progressively dividing larger clusters to form smaller clusters.
Generally, many clustering algorithms work well when the dataset is numerical (i.e., when data within the dataset are all related by some inherent similarity metric or natural order). Numerical datasets often describe a single attribute or category. Categorical datasets, on the other hand, describe multiple attributes or categories that are often discrete, lacking a natural similarity measure between them. From the clustering perspective this also implies that the centroid of a cluster in a categorical dataset is an undefinable entity. Therefore, categorical data is usually not effectively clustered using partitional clustering techniques. Hierarchical clustering is somewhat more effective than partitional clustering techniques, but its usefulness is limited to simple pattern-matching applications and does not provide meaningful numerical quantities from the categorical dataset.
Moreover, in many clustering applications, it is desirable to identify a specific layer within the cluster tree that best describes the underlying distribution of patterns within the dataset. However, it is often difficult to identify such optimal layer that contains a unique cluster set containing an optimal number of clusters. Further, it is known that different selection criteria converge to different values of model cardinality. Accordingly, it would be beneficial to provide a system and method capable of selecting a unique cluster set containing an optimal number of clusters.