1. Field of Invention
Embodiments of the present invention relate generally to methods and systems adapted to cluster categorical data. More specifically, embodiments of the present invention relate to methods and systems adapted to cluster categorical data using subspace bounded recursive clustering.
2. Discussion of the Related Art
Data is often organized in a clustering process by separating an arbitrary dataset into a plurality of subsets (i.e., clusters) according to some inherent similarity measure between data within the dataset, wherein data within a particular cluster is characterized by some common trait or attribute. Subsequently, category labels are generated using the clusters and a classifier for the dataset is constructed using the category labels. Clustering processes can be characterized according to the manner in which they form clusters. Two common clustering techniques include partitional and hierarchical techniques.
Partitional clustering techniques organize a dataset into a single collection of clusters that usually do not overlap, wherein data within each cluster is uniformly similar. Hierarchical clustering algorithms, on the other hand, create a hierarchy of clusters representing a range (e.g., from coarse to fine) of intra-cluster similarity. Hierarchical clustering algorithms are generally classified according to the manner in which they construct the cluster hierarchy. Thus, agglomerative hierarchical clustering algorithms build the cluster hierarchy from the bottom up by progressively merging smaller clusters into larger clusters while divisive hierarchical clustering algorithms build the hierarchy from the top down by progressively dividing larger clusters to form smaller clusters.
Generally, clustering algorithms work well when the dataset is numerical (i.e., when data within the dataset are all related by some inherent similarity metric or natural order). Numerical datasets often describe a single attribute or category. Categorical datasets, on the other hand, describe multiple attributes or categories that are often discrete, lacking a natural similarity measure between them. From the clustering perspective this also implies that the centroid of a cluster in a categorical dataset is an undefinable entity. Therefore, categorical data is usually not effectively clustered using partitional clustering techniques. Hierarchical clustering is somewhat more effective than partitional clustering techniques, but its usefulness is limited to simple pattern-matching applications due to the inherent sparsity. Moreover, because categorical datasets often have a high sparsity, measures of intra-cluster similarity is often negligible as intra-cluster dissimilarity is significantly more predominant, thereby preventing hierarchical clustering algorithms from providing meaningful numerical quantities from the categorical dataset.
For example, one type of categorical data (e.g., electronic program guide (EPG) data) contains an attribute (e.g., a descriptor field) that contains text from an unrestricted vocabulary. If text from this attribute is used in projecting the data onto a vector space, then the dimension of the vector space can quickly attain a high dimension (e.g., with O(1000) features) and sparse in that vectors within the dataset typically have more than 99% of their components equal to zero. For example, a typical EPG dataset may include 2,154 records, wherein the descriptor fields of the records collectively contain 2,694 unique terms. The average number of appearances of a term per record is 4.3. But this average is skewed upwards by a small number records (e.g., 2%) having a large number (e.g., 30 or more) of terms (i.e., nonzero features in the term vector). 56% of the records have 3 or fewer terms, resulting in a dataset having a sparsity of at least 3/2694≈99.9%. 76% of the records have 5 or fewer terms, giving a sparseness of at least 5/2694>99.8%.
Accordingly, it would be beneficial to organize categorical datasets according to a process that: 1) reduced the degree of discreteness between attributes or categories; and 2) reduced the sparsity of the dataset that is ultimately organized.