The problem of data mining or knowledge discovery is becoming increasingly important in recent years. There is an enormous wealth of information embedded in large corporate databases and data warehouses maintained by retailers, telecom service providers and credit card companies that contain information related to customer purchases and customer calls. Corporations could benefit immensely in the areas of marketing, advertising and sales if interesting and previously unknown customer buying and calling patterns could be discovered from the large volumes of data.
Clustering is a useful technique for grouping data points such that points within a single group/cluster have similar characteristics, while points in different groups are dissimilar. For example, consider a market basket database containing one transaction per customer with each transaction containing the set of items purchased by the customer. The transaction data can be used to cluster the customers such that customers with similar buying patterns are in a single cluster.
For example, one cluster may consist of predominantly married customers with infants that buy diapers, baby food, toys, etc. (in addition to necessities like milk, sugar, butter, etc.), while another may consist of high-income customers that buy imported products like French and Italian wine, Swiss cheese and Belgian chocolate. The clusters can then be used to characterize the different customer groups, and these characterizations can be used in targeted marketing and advertising such that specific products are directed toward specific customer groups.
The characterization can also be used to predict buying patterns of new customers based on their profiles. For example, it may be possible to conclude that high-income customers buy imported foods, and then mail customized catalogs for imported foods to these high-income customers.
The above market basket database, containing customer transactions, is an example of data points with attributes that are non-numeric. Transactions in the database can be viewed as records with boolean attributes, each attribute corresponding to a single item. Further, in a record for a transaction, the attribute corresponding to an item is "true" if and only if the transaction contains the item; otherwise, it is "false." Boolean attributes themselves are a special case of categorical attributes.
The domain of categorical attributes is not limited to simply true and false values, but could be any arbitrary finite set of values. An example of a categorical attribute is color whose domain includes values such as brown, black, white, etc. A proper method of clustering in the presence of such categorical attributes is desired.
The problem of clustering can be defined as follows: given n data points in a d-dimensional space, partition the data points into k clusters such that the data points in a cluster are more similar to each other than data points in different clusters. Existing clustering methods can be broadly classified into partitional and hierarchical methods. Partitional clustering attempts to determine k partitions that optimize a certain criterion function. The most commonly used criterion function is: ##EQU1##
In the above equation, m.sub.i is the centroid of cluster C.sub.i while d(x,m.sub.i ) is the Euclidean distance between x and m.sub.i. The Euclidean distance between two points (x.sub.1, x.sub.2, . . . x.sub.d) and (y.sub.1, y.sub.2, . . . y.sub.d) is (.SIGMA..sub.i=l.sup.d (x.sub.i -y.sub.i).sup.2).sup.1/2. Thus the criterion function E attempts to minimize the distance of every point from the mean of the cluster to which the point belongs. A common approach is to minimize the criterion function using an iterative, hill-climbing technique. That is, starting with an initial number of k partitions, data points are moved from one cluster to another to improve the value of the criterion function.
While the use of the above criterion function could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes. For example, consider the above market basket database. Typically, the number of items and the number of attributes in such a database are very large (a few thousand) while the size of an average transaction is much smaller (less than a hundred). Furthermore, customers with similar buying patterns and, therefore, belonging to the same cluster, may buy a small subset of items from a much larger set that defines the cluster.
For instance, consider the cluster defined by the set of imported items like French wine, Swiss cheese, Italian pasta sauce, Belgian beer, etc. Every transaction within the cluster contains only a small subset of the above items. Thus, it is possible that a pair of transactions in a cluster will have a few items in common, but remain linked by a number of other transactions within the cluster (having a substantial number of items in common with the two transactions).
In addition, the set of items defining the clusters may not have uniform sizes. A cluster including common items such as diapers, baby food and toys, for example, will typically involve a large number of items and customer transactions, while the cluster defined by imported products will be much smaller. In the larger cluster, since the transactions are spread out over a larger number of items, most transaction pairs will have few items in common. Consequently, a smaller percentage of these transaction pairs will have a large number of items in common. Thus, the distances of the transactions from the mean of the larger cluster will be much larger.
Since the criterion function of the partitional clustering method is defined in terms of distance from the mean of a cluster, splitting large clusters generally occurs. By splitting larger clusters the distance between a transaction and the mean of the cluster is reduced and, accordingly, the criterion function is also reduced. Therefore, the partitional clustering method favors splitting large clusters. This, however, is not desirable since the large cluster is split even though transactions in the cluster are well connected and strongly linked.
Hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. Current hierarchical clustering methods, however, are unsuitable for clustering data sets containing categorical attributes. For instance, consider a centroid-based hierarchical clustering method in which, initially, each point is treated as a separate cluster. Pairs of clusters whose centroids or means are the closest are then successively merged until the desired number of clusters remain. For categorical attributes, however, distances between centroids of clusters is a poor estimate of the similarity between them as is illustrated by the following example.
Consider a market basket database containing the following 4 transactions concerning items 1, 2, 3, 4, 5 and 6: (a) {1, 2, 3, 5}, (b) {2, 3, 4, 5}, (c) {1,4}, (d) {6}. The transactions can be viewed as points with boolean attributes (where 0 indicates that an item is missing while 1 indicates that an item is present in the transaction) corresponding to the items 1, 2, 3, 4, 5 and 6. The four points (a, b, c, d) thus become (1,1,1,0,1,0), (0,1,1,1,1,0), (1,0,0,1,0,0) and (0,0,0,0,0,1). Using the Euclidean distance metric to measure the closeness between points/clusters, the distance between the first two points (a and b) is .sqroot.2, which is the smallest distance between any pairs of the four points. As a result, points a and b are merged by the centroid-based hierarchical approach. The centroid of the new merged cluster is (0.5,1,1,0.5,1,0). In the next step, the third and fourth points (c and d) are merged since the distance between them is .sqroot.3which is less than the distance between the centroid of the merged cluster and points c or d respectively. However, this leads to merging transactions {1, 4} and {6} that do not have a single item in common. Thus, using distances between the centroids of clusters while making decisions about the clusters to merge could cause points belonging to different clusters to be assigned to the same cluster.
Once points belonging to different clusters are merged, the situation gets progressively worse as the clustering progresses. What typically happens is a ripple effect, that is, as the cluster size grows, the number of attributes appearing in the mean go up, and their value in the mean decreases. This makes it very difficult to distinguish the difference between two points that differ on few attributes, or two points that differ on every attribute by small amounts. The following example will make this issue very clear.
Consider the means of two clusters (1/3, 1/3, 1/3, 0, 0, 0) and (0, 0, 0, 1/3, 1/3, 1/3) with roughly the same number of points. Even though, the two clusters have no attributes in common, the Euclidean distance between their means is less than the distance of a point (1, 1, 1, 0, 0, 0) to the mean of the first cluster even though the point has items in common with the first point. This is undesirable since the point shares common attributes with the first cluster. A method based on distance will merge the two clusters and will generate a new cluster with mean (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
An interesting side effect of this merger is that the distance of the point (1, 1, 1, 0, 0, 0) to the new cluster is even larger than the original distance of the point to the first of the merged clusters. In effect, what is happening is that the center of the cluster is spreading over more and more attributes. As this tendency starts, the cluster center becomes closer to other centers which also span a large number of attributes. Thus, these centers tend to spread across all attributes and lose the information about the points in the cluster that they represent.
"Set theoretic" similarity measures such as the Jaccard coefficient have often been used, instead of the Euclidean distance, for clustering data contained within databases. The Jaccard coefficient for similarity between transactions T.sub.1 and T.sub.2 is ##EQU2## With the Jaccard coefficient as the distance measure between clusters, centroid-based hierarchical clustering schemes cannot be used since the similarity measure is non-metric, and defined for only points in the cluster and not for its centroid. Therefore, a minimum spanning tree (MST) hierarchical clustering method or a hierarchical clustering method using a group average technique must also be employed. The MST method merges, at each step, the pair of clusters containing the most similar pair of points while the group average method merges the ones for which the average similarity between pairs of points is the highest.
The MST method is known to be very sensitive to outliers (an outlier is a noise impulse which is locally inconsistent with the rest of the data) while the group average method has a tendency to split large clusters (since the average similarity between two subclusters of a large cluster is small). Furthermore, the Jaccard coefficient is a measure of the similarity between only the two points in question. Thus it does not reflect the properties of the neighborhood of the points. Consequently, the Jaccard coefficient fails to capture the natural clustering of data sets forming clusters with categorical attributes that are close to each other as illustrated in the following example.
FIG. 1 illustrates two transaction clusters 10, 12 of a market basket database containing items 1, 2, . . . 8, 9. The first cluster 10 is defined by 5 items while the second cluster 12 is defined by 4 items. These items are shown at the top of the two clusters 10, 12. Note that items 1 and 2 are common to both clusters 10, 12. Each cluster 10, 12 contains transactions of size 3, one for every subset of the set of items that define the clusters 10, 12.
The Jaccard coefficient between an arbitrary pair of transactions belonging to the first cluster 10 ranges from 0.2 (e.g., {1, 2, 3} and {3, 4, 5}) and 0.5 (e.g., {1, 2, 3} and {1, 2, 4}). Note that even though {1, 2, 3} and {1, 2, 7} share two common items and have a high coefficient (0.5), they belong to different clusters 10, 12. In contrast, {1, 2, 3} and {3, 4, 5} have a lower coefficient (0.2), but belong to the same cluster 10.
The MST method would first merge transactions {1, 2, 3} and {1, 2, 7} since the Jaccard coefficient has the maximum value (0.5). Once this merger occurs, the new cluster may subsequently merge with transactions from both clusters 10, 12, for example, {1, 3, 4} or {1, 6, 7}, since these are very similar to transactions in the new cluster. This is an expected result since it is well known that the MST method is fragile when clusters are not well-separated.
The use of a group average technique for merging clusters ameliorates some of the problems with the MST method. However, it may still fail to discover the correct clusters. For example, similar to the MST method, the group average method would first merge a pair of transactions containing items 1 and 2, but belonging to the different clusters 10, 12. The group average of the Jaccard coefficient between the new cluster and every other transaction containing both 1 and 2 is still the maximum (0.5). Consequently, every transaction containing both 1 and 2 may get merged together into a single cluster in subsequent steps of the group average method. Therefore, transactions {1, 2, 3} and {1, 2, 7} for the two separate clusters 10, 12 may be assigned to the same cluster by the time the method completes.
The problem of clustering related customer transactions in a market basket database has recently been addressed by using a hypergraph approach. Frequent itemsets used to generate association rules are used to construct a weighted hypergraph. Each itemset is a hyperedge in the weighted hypergraph and the weight of the hyperedge is computed as the average of the confidences for all possible association rules that can be generated from the itemset. Then, a hypergraph partitioning procedure is used to partition the items such that the sum of the weights of hyperedges that are cut due to the partitioning is minimized. The result is a clustering of items (not transactions) that occur together in the transactions. The item clusters are then used as the description of the cluster and a scoring metric is used to assign customer transactions to the best item cluster. For example, a transaction T may be assigned to the item cluster C.sub.i for which the ratio ##EQU3## is the highest.
The rationale for using item clusters to cluster transactions is questionable. For example, the approach makes the assumption that itemsets that define the clusters are disjoint and have no overlap among them. This may not be true in practice since transactions in different clusters may have a few common items. For instance, consider the market basket database in the above example (FIG. 1). With minimum support set to 2 transactions, the hypergraph partitioning scheme generates two item clusters of which one is {7} and the other contains the remaining items (since 7 has the least hyperedges to the other items). However, this results in transactions {1, 2, 6} (from cluster 12) and {3, 4, 5} (from cluster 10) being assigned to the same cluster since both have the highest score with respect to the big item cluster.
Accordingly, a clustering method that can correctly identify clusters containing data with categorical attributes, while efficiently handling outliers, is still needed.