1. Field of Invention
Embodiments of the present invention relate generally to methods and systems adapted to cluster categorical data. More specifically, embodiments of the present invention relate to methods and systems adapted to cluster categorical data using a seed based clustering technique.
2. Discussion of the Related Art
Data is often organized in a clustering process by separating an arbitrary dataset into a plurality of subsets (i.e., clusters) according to some inherent similarity measure between data within the dataset, wherein data within a particular cluster is characterized by some common trait or attribute. Subsequently, category labels are generated using the clusters and a classifier for the dataset is constructed using the category labels. Clustering processes can be characterized according to the manner in which they form clusters. Two common clustering techniques include partitional and hierarchical techniques.
Partitional clustering techniques organize a dataset into a single collection of clusters that usually do not overlap, wherein data within each cluster is uniformly similar. Unconstrained hierarchical clustering algorithms, on the other hand, create a hierarchy of clusters representing a range (e.g., from coarse to fine) of intra-cluster similarity. Such hierarchical clustering algorithms are generally classified according to the manner in which they construct the cluster hierarchy. Thus, agglomerative hierarchical clustering algorithms build the cluster hierarchy from the bottom up by progressively merging smaller clusters into larger clusters while divisive hierarchical clustering algorithms build the hierarchy from the top down by progressively dividing larger clusters to form smaller clusters.
Generally, many clustering algorithms work well when the dataset is numerical (i.e., when data within the dataset are all related by some inherent similarity metric or natural order). Numerical datasets often describe a single attribute or category. Categorical datasets, on the other hand, describe multiple attributes or categories that are often discrete, lacking a natural similarity measure between them. From the clustering perspective this also implies that appropriate exemplar codebook vectors (i.e., seeds) are, at best, difficult to obtain. Therefore, categorical data is usually not effectively clustered using partitional clustering techniques. Conventional hierarchical clustering techniques do not require codebook vectors and are somewhat more effective than partitional clustering techniques, but their usefulness is limited to simple pattern-matching applications and does not provide meaningful numerical quantities from the categorical dataset. In some cases, however, a user of a hierarchical agglomerative clustering system may have some previous knowledge of a record and wish to retrieve records within the dataset that are similar to the previously known record.
Accordingly, it would be beneficial to provide a system and method capable of clustering a categorical dataset in a manner that can meaningfully and numerically quantify the dataset. Moreover, it would be beneficial to provide a system and method of merging data points/clusters of data points in such a manner as to exploit prior information known to the user (e.g., represented within a codebook).