1. Field of Invention
Embodiments of the present invention relate generally to methods and systems adapted to cluster categorical data. More specifically, embodiments of the present invention relate to methods and systems adapted to augment a categorical dataset by imputation.
2. Discussion of the Related Art
Data is often organized in a clustering process by separating an arbitrary dataset into a plurality of subsets (i.e., clusters) according to some inherent similarity measure between data within the dataset, wherein data within a particular cluster is characterized by some common trait or attribute. Subsequently, category labels are generated using the clusters and a classifier for the dataset is constructed using the category labels. Clustering processes can be characterized according to the manner in which they form clusters. Two common clustering techniques include partitional and hierarchical techniques.
Partitional clustering techniques organize a dataset into a single collection of clusters that usually do not overlap, wherein data within each cluster is uniformly similar. Hierarchical clustering algorithms, on the other hand, create a hierarchy of clusters representing a range (e.g., from coarse to fine) of intra-cluster similarity. Hierarchical clustering algorithms are generally classified according to the manner in which they construct the cluster hierarchy. Thus, agglomerative hierarchical clustering algorithms build the cluster hierarchy from the bottom up by progressively merging smaller clusters into larger clusters while divisive hierarchical clustering algorithms build the hierarchy from the top down by progressively dividing larger clusters to form smaller clusters.
Generally, clustering algorithms work well when the dataset is numerical (i.e., when data within the dataset are all related by some inherent similarity metric or natural order). Numerical datasets often describe a single attribute or category. Categorical datasets, on the other hand, describe multiple attributes or categories that are often discrete, lacking a natural similarity measure between them. From the clustering perspective this also implies that the centroid of a cluster in a categorical dataset is an undefinable entity. Therefore, categorical data is usually not effectively clustered using partitional clustering techniques.
It has been repeatedly observed that dissimilarity or keyword-wise mismatch among records is as useful as their similarity or keyword-wise match. In information-theoretic calculations, similarity produces an “attractive force” and dissimilarity generates a “repulsive force”—both of which are equally required to generate a clustering that accurately represents the underlying pattern in the dataset. In hierarchical agglomerative clustering, when intra-cluster similarity is predominant, clusters having highest similarity are merged. This behavior is in perfect alignment with the intuition of the user. However, when intra-cluster dissimilarity is predominant, clusters having lowest dissimilarity are merged. This behavior is somewhat counter-intuitive because users tend to look for similarity and ignore dissimilarity.
The problem outlined above is especially noticeable with respect to highly sparse categorical data, where dissimilarity is predominant most of the time. For example, one type of categorical data (e.g., electronic program guide (EPG) data) contains an attribute (e.g., a descriptor field) that contains text from an unrestricted vocabulary. If text from this attribute is used in projecting the data onto a vector space, then the dimension of the vector space can quickly attain a high dimension (e.g., with O(1000) features) and sparse in that vectors within the dataset typically have more than 99% of their components equal to zero. For example, a typical EPG dataset may include 2,154 records, wherein the descriptor fields of the records collectively contain 2,694 unique terms. The average number of appearances of a term per record is 4.3. But this average is skewed upwards by a small number records (e.g., 2%) having a large number (e.g., 30 or more) of terms (i.e., nonzero features in the term vector). 56% of the records have 3 or fewer terms, resulting in a dataset having a sparsity of at least 3/2694≈99.9%. 76% of the records have 5 or fewer terms, giving a sparseness of at least 5/2694>99.8%.
Considering both similarity and dissimilarity simultaneously across records in a categorical dataset, and at the same time producing a clustering that is in alignment with a user's similarity-biased intuition have conventionally been viewed as two apparently contradictory requirements, but would be useful if incorporated within a clustering procedure. Accordingly, it would be beneficial to reduce the inherent sparsity of categorical datasets while increasing the overall quality of the categorical dataset to aid hierarchical agglomerative clustering processes in creating high-quality, clustering solutions that are in alignment with a user's similarity biased intuition.