1. Field of the Invention
This invention relates to the field of data analysis. In particular, the invention relates to the use of probabilistic clustering to produce a decision tree.
2. Description of the Related Art
Clustering
Identifying clusters helps in the identification of patterns in a set of data. As the size of sets of data such as databases, data warehouses, and data marts has grown, this type of knowledge discovery, or data mining, has become increasingly more important. Data mining allows patterns and predictions about these large sets of data be made.
Additionally, for many decision making processes, it is also important that the results be interpretable, or understandable. Complex formulas or graphical relationships may not be suited to allowing a human to gain insight as to the trends and patterns in a set of data.
For example, consider a financial institution that wants to evaluate trends in its loan practices. The actual set of data with loan information and lending decisions may have millions of data points. By identifying clusters, it is possible to identify groups of records that exhibit patterns, or strong internal consistencies, to one another. For example, one cluster of people who were approved for loans might be those with high incomes and low debts. While this is not a tremendous surprise, clustering can also identify non-obvious patterns in the data. The results may also have predictive value about future loans. For example, one cluster might reveal a high number of approved loans in one region, but not another similar region. This information may be useful in making future lending decisions.
When the clusters are interpretable, they can be used to drive decision making processes. However, if the resulting clusters are described in terms of complex mathematical formulas, graphs, or cluster centroid vectors, the usefulness of the clusters is diminished. An example of an interpretable cluster is residents of zip code 94304 (Palo Alto, Calif.). This cluster is easily understood without additional explanation. The cluster can be used to make decisions, adjust company strategies, etc. An example of a non-interpretable cluster is one defined mathematically, e.g. all data points within a given Euclidean distance from a centroid vector.
Several techniques have been used for clustering. The underlying concept behind clustering is that each of the data points in a set of data can be viewed as a vector in a high dimensional space. The vector for a data point is comprised of attributes, sometimes called features. For example, consider a set of data with two attributes for each data element: time and temperature. Thus, a data point X can be written as a 2-vector: X=(x1, x2), where x1 is time and X2 is temperature. As the number of attributes increases, the vector increases in length. For n attributes, a data point X can be represented by an n-vector:
X=(x1,x2, . . . ,xn).
In database terminology, the set of data could be a table, or a combination of tables. The data points are records, also called entries or rows. The attributes are fields, or columns.
k-means
One common technique for identifying clusters is the k-means technique. (See Krishnaiah, P. R. and Kanal, L. N., Classification Pattern Recongition, and Reduction in Dimensionality, Amsterdam: North Holland, 1982.) The k-means technique is iterative. The process starts with the placement of k centroids in the domain space. Then, the centroids are adjusted in an iterative process until their position stabilizes. The result is that the clusters are defined in terms of the placement of the centroids. FIG. 1 shows a set of clusters defined by centroids through the k-means technique. The data points are indicated with xe2x80x9c.xe2x80x9d in the two dimensional data domain space. The centroids are indicated with xe2x80x9cxxe2x80x9d. The resulting clusters are formed by those data points within a certain distance of the centroids as indicated by the ellipsoids.
In order to position the centroids and define the clusters, the k-means technique relies on the existence of a similarity, or distance, function for the domain. For example, in a set of data with a domain comprising time and temperature data points, Euclidean distance can be used. In other cases, the Hamming distance is used. However, if the data set comprises discrete attributes, e.g. eye color, race, etc., no clear similarity function is available. This lack of domain independence for the k-means technique limits its application to data domains for which there are well defined similarity functions.
The clusters that result from the k-means technique are difficult to interpret. Because the cluster is defined by a centroid and a distance from the centroid, it is not easy to interpret the results. Returning to the example of loan approval data for a bank, the resulting report would be a list of centroids and distances from the centroids for bank loan data points. The contents of the clusters would not be apparent. This type of information is not easily used to drive decision making processes, except perhaps after further computer analysis.
The k-means technique is also fairly computationally expensive, especially given that additional computational resources will have to be used if any analysis of the clusters is required. In big-O notation, the k-means algorithm is O(knd), where k is the number of centroids, n is the number of data points, and d is the number of iterations.
Hierarchical Agglommerative Clustering
Another prior art technique is hierarchical agglommerative clustering (HAC). (See Rasmussen, E. Clustering Algorithms. In Information Retrieval: Data Structures and Algorithms, 1992.) The basic idea behind HAC is that the clusters can be built in a tree-like fashion starting from clusters of one data point and then combining the clusters until a single cluster with all of the data points is constructed. FIG. 2 illustrates the clusters generated by HAC. The process is as follows, each data point is placed into a cluster by itself, shown by circles surrounding the single data point in FIG. 2. Then, at the next step, a similarity, or distance function is used to find the closest pair of smaller clusters, which are then merged into a larger cluster. The resulting clusters are junctions in the dendogram shown in FIG. 2. The process of combining clusters is continued as the tree is built from the bottom to the top as indicated by the arrow in FIG. 2 showing the flow of time.
As in the k-means technique, a similarity, or distance, function is needed. Therefore, HAC cannot be used on data domains with discrete attributes without a suitable distance function. Also, as in the k-means technique, the resulting clusters are not interpretable, other than by their centroids. For example, turning to the clusters developed in FIG. 2, if the user decided that they wanted to consider four clusters, they would select the stage of the process where four clusters existed. Those clusters though are not susceptible to meaningful interpretation except perhaps through further computer analysis. Also, HAC is computationally expensive, O(n2), where n is the number of data points.
Returning to the example of loan approval data for a financial institution, knowing that there are two clusters, one with these five million data points and the other with seven million does not convey much, or perhaps any, meaningful information to a human. That is because the clusters produced by HAC are defined in terms of centroids like in k-means.
AutoClass
Another prior art technique is AutoClass, developed by NASA. (See Cheeseman, P. and Stutz, J. Bayesian Classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining. AAAI Press 1996.) Unlike k-means and HAC, AutoClass can work on domains with discrete attributes and is domain independent because no domain specific similarity functions are required. The concept behind AutoClass is to identify k distributions, e.g. the n-dimensional Gaussian distribution, and fit those k distributions to the data points. The model builds up using multiple values of k in successive loops through the process until the fit of the distributions to the data sets can not be improved by adding additional distributions. During each pass, every record in the set of data must be accessed. Further, during each pass, data must be maintained for each data point about which of the distributions the data point is in.
FIG. 3 shows a possible mixture model that may be found after applying the AutoClass technique. The data set is shown as a solid line distributed across the domain. The dashed lines indicate the three distributions currently fit to the data. The number of distributions is the number of clusters. In FIG. 3, there are three clusters.
The results of AutoClass can be extremely difficult to interpret. Although FIG. 3 shows a clean separation between the three distributions, the distributions actually extend in both directions at very low levels. Thus, to answer questions about the contents of a cluster, you get a conditional probability: P(blue eyes|cluster 1)=0.9, etc. However, even in this simple one-dimensional data set of eye colors, the P(blue eyes|cluster 2) will be non-zero as well. For higher dimensional data sets, the results are even more difficult to interpret. This lack if interpretability reduces the usefulness of AutoClass in understanding data sets.
Thus, like k-means and HAC, AutoClass results are difficult because the clusters are not easily defined by a logical rule, but are rather expressed in terms of conditional probabilities. This makes it extremely difficult to use the generated results to drive decision making, make predictions, or identify patterns without further analysis.
AutoClass is more computationally expensive then either k-means or HAC. AutoClass is O(nkdv), where n is the number of data points, k is the number of distributions, d is the number of iterations for each model, and v is the number of models, or different k values considered. Further, this big-O notation does not take into account the heavy data access costs imposed by AutoClass or the additional storage requirements.
COBWEB
The previously discussed techniques were all oriented towards clustering entire sets of data. COBWEB is an online, or incremental approach to clustering. FIG. 4 shows a COBWEB tree structure with clusters. The clusters are the nodes of the tree. FIG. 4 shows a new data point, X, to be added to the data set. COBWEB is based on a probability distribution between the nodes of the tree. Because of the incremental nature of the technique, there are a number of special cases to handle merging and splitting of tree nodes based on subsequently received data.
Like AutoClass results, the clusters are defined by conditional probabilities that are not easily interpreted. Also, the performance of the COBWEB algorithm is sensitive to tree depth, thus if the initial data inserted into the tree is not representative of the whole tree, the algorithm performance may degrade. The predicted big-O time to add a single object is O(B2 logB nxc3x97AV), where n is the number of data points, B is the average branching factor of the tree, A the number of attributes, and V is the average number of values per attribute.
COBWEB results are not easily used to drive decision making processes, or to identify patterns in a set of data, or to predict trends in the set of data. The conditional probabilities make interpreting the results particularly difficult to interpret. Also, because COBWEB has a certain sensitivity to the initial data points, the developed clusters may reflect clusters that are not formed on the most significant attributes. For example, if the initial thousand points in a set of data with ten million points reflect mostly rejected loans, the tree structure may become imbalanced as the remainder of the data points are added. Thus, in addition to being difficult to interpret, the nature of the identified clusters may be skewed based on the initial data.
The prior art systems do not provide a clustering technique that produces interpretable clusters, is scalable to large data sets, e.g. fast, and has domain independence. Accordingly, what is needed is a clustering technique that produces interpretable results, scales to handle large data sets well, and is domain independent. Further, what is needed is the ability to apply the clustering technique to data marts and data warehouses to produce results usable in decision making by the identification of meaningful clusters.
Some embodiments of the invention include a method for scalable probabilistic clustering using a decision tree. The method runs in time that is linear with respect to the number of data points in the set of data being clustered. Some embodiments of the invention can be run against a set of data such as a database, data warehouse, or data mart without creating a significant performance impact. Some embodiments of the invention access the set of data only a single time.
Some embodiments of the invention produce interpretable clusters that can be described in terms of a set of attributes and attribute values for that set of attributes. In some embodiments, the cluster can be interpreted by reading the attribute values and attributes on the path from the root node to the node of the decision tree corresponding to the cluster.
In some embodiments, it is not necessary for there to be a domain specific similarity, or distance function, for the attributes.
In some embodiments, a cluster is determined by identifying an attribute with the highest influence on the distribution of the other attributes. Each of the values assumed by the identified attribute corresponds to a cluster, and a node in the decision tree. For example, an attribute for gender might have the values xe2x80x9cmalexe2x80x9d, xe2x80x9cfemalexe2x80x9d, xe2x80x9cno responsexe2x80x9d. Thus, if the gender attribute has the highest influence on the distribution of the remaining attributes, e.g. number of dresses purchased attribute, etc., then three clusters would be determined: a gender=xe2x80x9cmalexe2x80x9d cluster, a gender=xe2x80x9cfemalexe2x80x9d cluster, and a gender=xe2x80x9cno responsexe2x80x9d cluster.
In some embodiments, these clusters are further refined by recursively applying the method to the clusters. This can be done without additional data retrieval and minimal computation.
In some embodiments, subsetting can be used to combine clusters. For example, the income attribute might have two values, one for xe2x80x9c$10,000 to $20,000xe2x80x9d and another for xe2x80x9c$20,000 to $30,000xe2x80x9d. However, the similarities between data points with those two distinct values for income might be great enough that instead of treating the two values as two separate clusters, they instead are subsetted into a single cluster.
In some embodiments, feature elimination can be used to eliminate from consideration attributes that have highly uniform distributions. In some embodiments, the entropy of each attribute is computed to determine the uniformity of the distributions of the attributes. For example, in a set of data with several hundred attributes, feature elimination can be used to eliminate those features that would not play significant factors in clustering.
In some embodiments, the CUBE operation is used to access the set of data a single time.