Clustering is the process of grouping objects so that points grouped in the same cluster are similar (and points grouped in different clusters are dissimilar). Community discovery is the process of identifying groups that are similar. These related processes have been studied in numerous areas of application including the finding of product communities, data mining, pattern recognition, and machine learning. By and large, research in these areas has focused on partitioning objects so that some clustering measure such as the k-median measure (where the object is to minimize the average distance from a point to its nearest center) is optimized. It should be appreciated that the objective of the conjunctive clustering process is to find a predetermined number of clusters of at least a minimum size that do not overlap by more than a predetermined amount.
In many applications, clustering may be needed to obtain meaningful conjunctive descriptions of groups of objects. For example in a customer segmentation application, the objective is to identify clusters of customers that have similar buying behavior. The cluster descriptions that emerge may then be used to facilitate operations such as target marketing. By contrast, in a text clustering application, the objective is to find descriptions of groups of documents that contain similar content. The words that are typical of the cluster may then be used to describe the cluster. For instance, a cluster of documents that discuss how to print in the landscape mode may be described by the conjunction of keywords “laserjet and ‘landscape mode’ and printing”. It should be appreciated that such clusters may be used as a basis for constructing a topic hierarchy.
Conventionally, clusters may be found in a two step process that involves: (1) clustering (e.g., the grouping of objects) and (2) obtaining descriptions of the clusters. It should be appreciated that step (1) may be effectuated by optimizing the k median or the k-center quality measure of the objects that are grouped. And, step (2) may be effectuated by assigning the same class label to points in the same cluster and by performing operations that generate descriptions which separate the clusters.
Conventionally, clustering methodologies try to optimize a cost function while learning methodologies seek to find (possibly complex) descriptions that best fit each cluster. Therefore, a common byproduct of such processes is that the resulting descriptions may be difficult to understand. In addition, if the learning method is required to output conjunctions, the conjunctions may serve as poor descriptions of the clusters (since the clusters found are inherently more complex). Consequently, by performing the aforementioned steps (1) and (2) separately, (as some conventional methodologies do) one may be sacrificing the descriptive quality of the final clusters (either because they are too complicated to understand, or because they are too simple to describe the clusters).
It should be appreciated that in the process of finding clusters conventional methodologies may employ the generation of both a collection of points and a corresponding vector representation of these points in Euclidean space. Such methodologies characterize attributes in order to generate the vectors. From the generated vectors a collection of points emerge which may be subjected to a clustering algorithm to find the k median clusters.
A drawback of this approach is that it may not be meaningful to represent given data in Euclidean space (e.g., such as an attempted vector representation of a machine). In addition, it may be problematic to define the distance between two points since there must actually be a way of computing the distance between the two points. Moreover, in utilizing such conventional methodologies a notion of distance is needed in order for an appropriate minimization function to be formulated. Another drawback of conventional methodologies is that points are assigned to exactly one cluster and all of the points found may be required to be clustered (such methodologies may not allow the exclusion of some points in the clustering process).