1. Description of the Prior Art
The proliferation of computerized database management systems has resulted in the accumulation of large amounts of data by the users of such systems. To be able to use these huge storehouses of data to best advantage, a process called "data mining" has emerged. Data mining involves the development of tools that can extract patterns from the data and the utilization of the patterns to establish trends which can be analyzed and used.
Classification is an important aspect of data mining and involves the grouping of the data to make it easier to work with. U.S. Pat. No. 5,787,274 to Agrawal et al. provides extensive discussion of prior art classification methods used in data mining and is incorporated herein by reference.
Decision tree classification is the classification method of choice when time cost is an issue. As described in more detail below, prior art decision-tree classification methods utilize a two-phase process: a "building" phase and a "pruning" phase. In the building phase, a decision tree is built by recursively partitioning a training data set until all the records in a partition belong to the same class. For every partition, a new node is added to the decision tree. A partition in which all the records have identical class labels is not partitioned further, and the leaf corresponding to it is labeled with the class label. Those records that do not fall into the class of a particular node are passed off as a new branch until they fall within a subsequent class.
A number of algorithms for inducing decision trees have been proposed over the years. See, for example, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984; B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996; Manish Mehta, Rakesh Agrawal, and Jorma Rissanen, SLIQ: A Fast Scalable Classifier For Data Mining, In EDBT 96, Avignon, France, March 1996; J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1:81-106, 1986; J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993; B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996; J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993; John Shafer, Rakesh Agrawal, and Manish Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining, In Proc. of the VLDB Conference, Bombay, India, September 1996; all of which are incorporated herein by reference.
The building phase constructs a perfect tree (a tree that classifies, with precision, every record from the training set). However, one often achieves greater accuracy in the classification of new objects by using an imperfect, smaller decision tree rather than one which perfectly classifies all known records. The reason is that a decision tree which precisely classifies every record may be overly sensitive to statistical irregularities and idiosyncrasies of the training set. Thus, all of the prior art algorithms known to the applicant perform a pruning phase after the building phase in which nodes are iteratively pruned to prevent "overfitting" and to obtain a tree with higher accuracy.
For pruning, the Minimum Description Length ("MDL") principle (or other known pruning method) is applied to prune the tree built in the growing phase and make it more general. The MDL principle states that the "best" tree is the one that can be encoded using the fewest number of bits. Thus, under the MDL principle, during the pruning phase the subtree of the tree that can be encoded with the least number of bits is identified. The cost C (in bits) of communicating classes using a decision tree comprises (1) the bits to encode the structure of the tree itself, and (2) the number of bits needed to encode the classes of records in each leaf of the tree.
MDL pruning (1) leads to accurate trees for a wide range of data sets, (2) produces trees that are significantly smaller in size, and (3) is computationally efficient and does not use a separate data set for pruning. For the above reasons, the pruning algorithms of the present invention employ MDL pruning. A detailed explanation of MDL-based pruning can be found in MDL-based Decision Tree Pruning, Manish Mehta, Jorma Rissanen and Rakesh Agrawal, International Conference of Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995, incorporated herein by reference.
2. The Process of Decision-Tree Based Classification
To understand the present invention it is necessary to have a basic understanding of the process involved in building and pruning a decision tree. An initial data set comprises a group of sample records called the "training set" and is used to establish a model or description of each class, which model is then used to classify future records as they are input to the database.
Each sample record has multiple attributes and is identified or "tagged" with a special classifying attribute which indicates a class to which the record belongs. Attributes can be continuous (e.g., a series of numerical values which can be placed in chronological order and which can be grouped based on their numerical value) or categorical (a series of categorical values which can only be grouped based on their category). For example, as shown in FIG. 1, a training set might comprise sample records identifying the salary level (continuous attributes) and education level (categorical attributes) of a group of applicants for loan approval. In this example, each record is tagged with either an "accept" classifying attribute or a "reject" classifying attribute, depending upon the parameters for acceptance or rejection set by the user of the database. The goal of the classification step is to generate a concise and meaningful description for each class that can be used to classify subsequent records.
As shown in FIG. 1, there is a single record corresponding to each loan request, each of which is tagged with either the "accept" label if the loan request is approved or the "reject" label if the loan request is denied. Each record is characterized by each of the two attributes, salary (e.g., $10,000 per year) and education level completed (e.g. high-school, undergraduate, graduate).
1. Building Phase
FIG. 2 is an example of a decision tree for the training data in FIG. 1. Each internal node of the decision tree (denoted by a circle in FIG. 2) has a "test" involving an attribute, and an outgoing branch for each possible outcome. For example, at the root node 10 the test is "is the salary level of the applicant less than $20,000.00?" If the answer to this inquiry is "no," the loan application is automatically accepted, ending the inquiry and establishing a "leaf" 20 (a leaf is the ultimate conclusion of a partition after no further inquiry is to be made, and is denoted by a square in FIG. 2) for the acceptance. Thus, in the example, an applicant who has a salary greater than $20,000 is classified in a class for those applicants who qualify for a loan based on their salary alone.
If the answer to the test at root node 10 is "yes," (e.g., the applicants salary is less than $20,000) further inquiry is made to determine if the applicant can pass the test at internal node 30, namely "does the applicant possess at least a graduate level of education?" If the answer to this inquiry is "yes," then the loan is accepted, even though the salary level of the applicant is below the $20,000 threshold, and a leaf 40 is established for the acceptance. This places the applicant in a class comprising applicants who do not qualify for a loan based on salary but who do qualify based on their education level.
If the answer to the inquiry at node 30 is "no," then the loan is rejected and a leaf 50 is established for the rejection.
The outcome of the test at an internal node determines the branch traversed and thus the next node visited. The class for the record is simply the class of the final leaf node (e.g., accept or reject). Thus, the conjunction of all the conditions for the branches from the root to a leaf constitute one of the conditions for the class associated with the leaf.
An example of an algorithm for building a prior art decision tree is shown in FIG. 3. The root node of the tree is initialized in Step 1, while in Step 2, the queue Q (which keeps track of nodes that still need to be split) is initialized to contain the root node. In the "while" loop spanning steps 3 to 11, nodes in Q are recursively split until there are no further nodes remaining to be split (that is, until Q becomes empty). For each node that can be split, the attribute and split that results in minimum entropy (i.e., the split that can be described in the least number of bits) is determined in steps 6 and 7. The children of the split node are then appended to Q to be split further in Step 9. The algorithm of FIG. 3 is one example of a program for building a decision tree; decision-tree building is a known process and one of ordinary skill in the art could develop numerous other examples of programs for building a decision tree.
The tree is built breadth-first by recursively partitioning the data until each partition is pure, that is, until the classes of all records in the node are identical. The splitting condition for partitioning the data is either of the form A&lt;v if A is a numeric attribute (v is a value in the domain of A) or A.epsilon.V if A is a categorical attribute (V is a set of values from A's domain). For example, in the loan application example described with respect to FIGS. 1 and 2, at the first node, the test is "is the applicant's salary less than $20,000?". If we assume that the first applicant has a salary of $15,000, the attribute A=$15,000 and the value V=$20,000. Thus, at root node 10, the condition A&lt;V (i.e., is $15,000&lt;$20,000) yields a YES; thus, the attributes of this first applicant are passed on to the left branch (internal node 30) where an additional test takes place. If the condition A&lt;v resulted in a No answer, the attribute of this applicant would have been passed to the right branch and leaf 20 would have been formed, classifying this applicant in the class of applicants whose loan request is accepted.
Likewise, in the event of a categorical attribute (e.g., education level), if the attribute A of the first applicant is in the set of V (e.g., if the applicant in the loan application example does have at least a graduate level education), then a YES result is achieved and the attributes of this applicant are passed onto leaf 40, which classifies the applicant in the class of those who have not met the salary minimum but have met the education requirement. It is noted that a leaf is formed at node 40 because there are no further conditions that need to be met. If, on the other hand, at node 30 it is determined that the applicant does not possess at least a graduate education (the test "is A in the set of V?") yields a No response), then a leaf at node 50 is established, which is where all applicants who have not met the salary requirement and the education requirement are classified. Thus, each split is binary.
Each node of the decision tree maintains a separate list for every attribute. Each attribute list contains a single entry for every record in the partition for the node. The attribute list entry for a record contains three fields--the value for the attribute in the record, the class label for the record, and the record identifier. Attribute lists for the root node are constructed at the start using the input data, while for other nodes they are derived from their parent's attribute lists when the parent nodes are split. Attribute lists for numeric attributes at the root node are sorted initially and this sort order is preserved for other nodes by the splitting procedure. Also a histogram is maintained at each node that captures the class distribution of the records at the node. Thus, the initialization of the root node in Step 1 of the build algorithm of FIG. 3 involves (1) constructing the attribute lists, (2) sorting the attribute lists for numeric attributes, and (3) constructing the histogram for the class distribution.
For a set of records S, the entropy E(s) equals: EQU .SIGMA..sub.j p.sub.j log p.sub.j
where p.sub.j the relative frequency of class j in S. Thus, the more homogeneous a set is with respect to the classes of records in the set, the lower is its entropy. The entropy of a split that divides S with n records into sets S.sub.1 with n.sub.1 records and S.sub.2 with n.sub.2 records is ##EQU1##
Consequently, the split with the least entropy best separates classes, and is thus chosen as the best split for a node.
To compute the best split point for a numeric attribute, the (sorted) attribute list is scanned from the beginning and for each split point, and the class distribution in the two partitions is determined using the class histogram for the node. The entropy for each split point can thus be efficiently computed since the lists are stored in a sorted order. For categorical attributes, the attribute list is scanned to first construct a histogram containing the class distribution for each value of the attribute. This histogram is then utilized to compute the entropy for each split point.
Once the best split for a node has been found, it is used to split the attribute list for the splitting attribute amongst the two child nodes. Each record identifier along with information about the child node that it is assigned to (left or right) is then inserted into a hash table. The remaining attribute lists are then split using the record identifier stored with each attribute list entry and the information in the hash table. Class distribution histograms for the two child nodes are also computed during this step.
2. Pruning Phase
To prevent overfitting, the MDL principle can be applied to prune the tree built in the growing phase and make it more general.
First, the cost of encoding the data records must be determined. If it is assumed that a set S contains n records, each belonging to one of k classes, with n.sub.i being the number of records n with class i, the cost of encoding the classes for the n records is given by the equation: ##EQU2##
In the above equation, the first term is the number of bits to specify the class distribution, that is, the number of records with classes 1 . . . k. The second term is the number of bits required to encode the class for each record once it is known that there are n.sub.i records with class label i. The above equation is not very accurate when some values of the n.sub.i records are either close to zero or close to n. Thus, a better equation for determining the cost C(S) of encoding the classes for the records in set S is: ##EQU3##
In Equation 1, the first term is simply n*E(S), where E(S) is the entropy of the set S of records. Also, since k.ltoreq.n, the sum of the last two terms is always non-negative. This property is used in the present invention when computing a lower bound on the cost of encoding the records in a leaf, as discussed more fully below.
Next, the cost of encoding the tree must be determined. The cost of encoding the tree comprises three separate costs:
1. The cost of encoding the structure of the tree; PA1 2. The cost of encoding for each split, the attribute and the value for the split; and PA1 3. The cost of encoding the classes of data records in each leaf of the tree.
The structure of the tree can be encoded by using a single bit in order to specify whether a node of the tree is an internal node (1) or leaf (0). Thus, assuming that the left branches are "grown" first, the bit string 11000 encodes the tree in FIG. 2.
The cost of encoding each split involves specifying the attribute that is used to split the node and the value for the attribute. The splitting attribute can be encoded using log a bits (since there are a attributes), while specifying the value depends on whether the attribute is categorical or numeric. If it is assumed that v is the number of distinct values for the splitting attribute in records at the node, if the splitting attribute is numeric, then since there are v-1 different points at which the node can be split, log (v-1) bits are needed to encode the split point. On the other hand, for a categorical attribute, there are 2.sup.v different subsets of values of which the empty set and the set containing all the values are not candidates for splitting. Thus, the cost of the split is log (2.sup.v -2). For an internal node N, the cost of describing the split (C.sub.split) is C.sub.split (N). The cost of encoding the data records in each leaf is as described in Equation (1).
The simple recursive algorithm in FIG. 4 computes the minimum cost subtree rooted at an arbitrary node N and returns its cost. Let S be the set of records associated with N. If N is a leaf, then the minimum cost subtree rooted at N is simply N itself. Furthermore, the cost of the cheapest subtree rooted at N is C(S)+1 (since 1 bit is required in order to specify that the node is a leaf).
If N is an internal node in the tree with children N.sub.1 and N.sub.2, then there are the following two choices for the minimum cost subtree: (1) the node N itself with no children (this corresponds to pruning its two children from the tree, thus making node N a leaf), or (2) node N along with children N.sub.1 and N.sub.2 and the minimum cost subtrees rooted at N.sub.1 and N.sub.2. Of the two choices, the one with the lower cost results in the minimum cost subtree for N .
The cost for choice (1) is C(S)+1. In order to compute the cost for choice (2), in Steps 2 and 3 of FIG. 4, the procedure recursively invokes itself in order to compute the minimum cost subtrees for its two children. The cost for choice (2) is then C.sub.split (N)+1 +minCost.sub.1 +minCost.sub.2. Thus, the cost of the cheapest subtree rooted at N is given by minCost.sub.N as computed in Step 4 of FIG. 4. Note that if choice (1) has a smaller cost, then the children of node N must be pruned from the tree. Or stated alternately, children of a node N are pruned if the cost of directly encoding the data records at N does not exceed the cost of encoding the minimum cost subtree rooted at N.