Clustering is a division of data into groups of similar objects and can be thought of as grouping objects in such a way that the diversity within each cluster is minimized. In the case of numeric data, one can perform such a grouping by computing the distance or similarity between objects (points or vectors) and/or clusters. However, categorical (or nominal) data demands special treatment as the distance between objects is not always computable.
A common approach to clustering categorical data is to assume that there is no similarity between the features of an object and that all the features of an object are equally important. This assumption allows the entropy of the feature proportions in each cluster to be used as a criterion to segregate objects based on feature counts.
However, in practice, features of objects are often similar to each other. Existing methods deal with this by explicitly merging the similar features. However, the merging of features introduces other problems such as the imposition of relatively arbitrary thresholds in order to decide which features are similar enough to warrant being treated as identical. Also, the merging of features can result in a chaining effect, where a pair of features, say features 1 and 3, are merged when they are not similar, as a result of the merging of features 1 and 3 and features 2 and 3.