1. Technical Field
The present invention relates to assigning data to clusters. More particularly, some examples of the invention concern assigning categorical data to clusters and/or identifying outliers and/or anomalies in the categorical data.
2. Description of Related Art
The problem of clustering concerns finding groupings of data where data gathered together in each group are similar, and are at the same time different from those in other groups. Clustering has received a great deal of attention for numeric data. In this case, it is easy to construct mathematical formulas to measure the degree of similarity and separation between data points. One such method is known as k-means, where the person who wants to cluster data chooses the number of clusters (k) ahead of time, and assigns each data point to one of the k clusters with the objective of finding the assignment that minimizes:Σ(i=1, . . . , k)Σ(j=1, . . . , ni)∥xj−μi∥2 where ∥xj−μi∥2 is the Euclidean distance between vectors xj and μi, μi is the hypothesized mean of the ith cluster, xj is the jth point assigned to cluster i, where there are ni such points, j iterates from 1 to the number of points in each cluster, ni, and i iterates from 1 to the number of clusters k. In the case where x and μ are not vectors but are instead scalar numbers, the distance is calculated as the square of the difference between the two numbers. The k-means approach seeks to minimize the within-cluster distance of every point assigned to the cluster to the mean of the cluster.
While k-means and other methods have been developed for clustering numeric data, categorical data present significant difficulties for these methods. Categorical data are data in which the data elements are nonnumeric. For example, within a category of fruit, there may be apples, bananas, pears, and so forth. Within another category of colors, they may be red, yellow, and green. A clustering problem might require grouping data consisting of these fruits and colors, rather than a numeric characteristic associated with the fruits (e.g., length, volume) or colors (e.g., intensity, wavelength).
After data are clustered, it is often of interest to identify those data that are not well associated with any cluster. When treating numeric data, this is often accomplished by determining the minimum distance from any particular data point to a center of mass (center) of a cluster (for example, the mean of all points assigned to a cluster). If a point is not sufficiently close to the center of any cluster then it can be regarded as an “outlier” or “anomaly.” Distance is typically calculated in terms of the common Euclidean metric:Σ(i=1, . . . , n)∥xi−yi∥2 where there are n points, x1, . . . , xn, and n points y1, . . . , yn and the function ∥•∥2 is:Sqrt[(x1−y1)2+(x2−y2)2+ . . . +(xn−yn)2]and Sqrt is the square root function. Although these techniques are useful for numeric data, known techniques are inadequate for clustering categorical data and for identifying categorical data that are not well associated with any cluster.