1. Field of Invention
Embodiments of the present invention relate generally to methods and systems adapted to cluster categorical data. More specifically, embodiments of the present invention relates to methods and systems adapted to cluster categorical data using an order invariant clustering technique.
2. Discussion of the Related Art
Data is often organized in a clustering process by separating an arbitrary dataset into a plurality of subsets (i.e., clusters) according to some inherent similarity measure between data within the dataset, wherein data within a particular cluster is characterized by some common trait or attribute. Subsequently, category labels are generated using the clusters and a classifier for the dataset is constructed using the category labels. Clustering processes can be characterized according to the manner in which they form clusters. Two common clustering techniques include partitional and hierarchical techniques.
Partitional clustering techniques organize a dataset into a single collection of clusters that usually do not overlap, wherein data within each cluster is uniformly similar. Hierarchical clustering algorithms, on the other hand, create a hierarchy of clusters representing a range (e.g., from coarse to fine) of intra-cluster similarity. Hierarchical clustering algorithms are generally classified according to the manner in which they construct the cluster hierarchy. Thus, agglomerative hierarchical clustering algorithms build the cluster hierarchy from the bottom up by progressively merging smaller clusters into larger clusters while divisive hierarchical clustering algorithms build the hierarchy from the top down by progressively dividing larger clusters to form smaller clusters.
Generally, many clustering algorithms work well when the dataset is numerical (i.e., when data within the dataset are all related by some inherent similarity metric or natural order). Numerical datasets often describe a single attribute or category. Categorical datasets, on the other hand, describe multiple attributes or categories that are often discrete, lacking a natural similarity measure between them. From the clustering perspective this also implies that the centroid of a cluster in a categorical dataset is an undefinable entity. Therefore, categorical data is usually not effectively clustered using partitional clustering techniques. Hierarchical clustering is somewhat more effective than partitional clustering techniques, but its usefulness is limited to simple pattern-matching applications and does not provide meaningful numerical quantities from the categorical dataset.
Moreover, many agglomerative hierarchical clustering techniques merge data points (or clusters of data points) together according to some predefined rule of convergence until all data points (or clusters of data points) are merged into a single cluster. For example, many agglomerative hierarchical clustering algorithms take a conservative approach to merging data points/clusters of data points in that only one pair of data points/clusters of data points are merged into a single cluster (or only a few pairs of data points/clusters of data points are merged into a few clusters) in a single cycle. Such conservative logic can be summarized as follows: 1) select all data points that have a minimum distance between each other; 2) if the number of selected data point-pairs is 1, merge the pair of data points into a single cluster; 3) if the number of selected data point-pairs is greater than 1, then select the very first pair of data points and merge that pair of data points into a single cluster; and 4) map the merged cluster into a new layer and isomorphically map all other data points/clusters of data points into the new layer.
As shown above, the crux of the conservative logic lies in step 3 and is heavily dependent upon the order in which the data points are initially received into the system. As a result, it is possible that data points/clusters of data points will be merged together in different cycles for differently permuted datasets. The effects of the conservative logic are particularly strong in the early phase of the clustering process. From the perspective of the end-user, variation in the order in which the same data points/clusters of data points are merged across different browsing sessions can be confusing.
Accordingly, it would be beneficial to provide a system and method capable of clustering a categorical dataset in a manner that can meaningfully and numerically quantify the dataset. Moreover, it would be beneficial to provide a system and method of merging data points/clusters of data points in such a manner that does not depend on the order in which data points are received.