The present invention relates to computer techniques for developing binary decision trees from a training database, such decision trees used for classifying records according to probabilities derived from the training database. Specifically, the present invention provides a way of preparing or updating binary decision trees from very large training databases held in slow memory such as disk drives, the method reducing the necessary access to the slow memory.
Referring to FIG. 1, a large training database 10 has records 12 including a record identifier 14, record attributes 16, and a classification 18. For example, the record identifier 14 may be the name of a customer and the attributes may be the customer""s AGE, INCOME, and number of CHILDREN. The classification 18 may be, for example, whether the customer responded to a promotional coupon for children""s toys.
Desirably, the classification 18 could be determined for existing customers in a unclassified data 26 whose attributes 16 are known but who have not yet responded to the promotional coupon and thus cannot be classified. xe2x80x9cData miningxe2x80x9d seeks to establish a predictive classification of records based on the record""s attributes 16.
Referring to FIG. 2, the classification of records from their attributes may be accomplished by preparing a binary decision tree 24 from the training database 20 using any of a number of tree constructors 22 executed on an electronic computer as are well known in the art. The binary decision tree 24 is then used to sort the unclassified data 26 to produce as results 32 the appropriate classification.
Referring to FIG. 3, the binary decision tree 24 follows general tree topology including a root node 28a (shown at the top of FIG. 3), a number of intermediate nodes 28, and leaf nodes 30 (shown at the bottom of FIG. 3). Each intermediate node 28 is assigned to a particular attribute 16 and a split point in the domain of the attribute 16 which defines how records are to be sorted or passed to the nodes below. Each leaf node 30 is assigned to a particular classification.
The unclassified data 26 are sorted by comparing their attributes and the values of those attributes against the attributes and split points of each node starting at root node 28a and then passing the record according to that split point to the next lower node 28b. Thus, for example, the root node 28a may relate to the AGE attribute and have a splitting of AGE 30 (and a xe2x80x9csplitting predicatexe2x80x9d) that AGE must be less than or equal to 30). The records 12 of FIG. 1 are thus sorted at the root node 28a so that if their AGE attribute 16 has a value of less than 30, the record 12 proceeds down the right branch of the tree from root node 28a, but if the AGE attribute has a value greater than 30, the record 12 proceeds down the left branch of the tree from root node 28a. The branches from node 28a lead to additional nodes 28b and 28c, each also having an attribute and a splitting predicate and this process is repeated until the records arrive at a leaf node 30 where a category may be assigned. Note that the attributes for 28b and 28c need not be the same and in this case are AGE and INCOME, respectively.
The attributes 16 need not be numerical but may be categorical, for example, male or female, in which case the splitting predicate is a subset of the attributes"" domain.
Referring to FIG. 4, the tree constructor 22 which creates the binary decision tree 24 from the training database 20 may operate according to a number of well known algorithms to determine the attributes, their order within the binary decision tree 24, and the appropriate splitting predicates. A general model of a tree constructor 22 includes a sorter 35 receiving the records 12 and at each node 28 dividing them into left and right groups 38 and 40 according to a trial splitting predicate 36. The left and right groups 38 and 40 are provided to a goodness evaluator 42 which determines how effective the trial splitting predicate 36 is according to some predetermined criteria related to the classifications of the records of the left and right groups 38 and 40, for example, an impurity function.
The trial splitting predicate 36 is adjusted appropriately based on this determination and the records 12 reviewed again for evaluation. Ultimately, after possibly many reviews of the records, final splitting predicate 45 is produced (being an attribute, split point and relationship) for the node 28 and the process is repeated for other nodes 28. A goodness value 43 may be derived for each splitting predicate 45.
While particular tree construction algorithms vary, it can be seen that this process of determining splitting predicates 45 requires repeated access of the records 12. For large databases where the records 12 are held in relatively slow electronic memories, such as magnetic disk drives, constructing the binary decision tree 24 may be prohibitively time consuming. Even in cases where this investment in time is warranted for an initial generation of a binary decision tree 24, the time investment may discourage frequent updating of the binary decision tree 24 as additional data comes in.
One solution to the problem of slow memory access is to place the training database 20 in a high-speed memory such as those principally constructed of solid state transistors also known as random access memory (RAM). Such memories will be termed herein xe2x80x9chigh-accessxe2x80x9d memories distinguishing them from disk drives and other similar mass storage devices (xe2x80x9clow accessxe2x80x9d), both in the speed of memory access and in the flexibility of that access (random vs. sequential) which may affect the time required to access the necessary data of the training database 20. These categories are not absolute but reflect the inevitable differences between accessibility and capacity of current and foreseeable memory systems.
Unfortunately, the solution of using high access memory exclusively is not available for many commercially valuable training databases 20 which are too large for this to be practical. What is needed is a method of constructing and updating training databases 20 that overcomes the time limitation inherent in the use of low-access memory.
The present inventors have recognized that a binary decision tree constructed from a small subset of the training database (sized to fit entirely in high access memory) will nevertheless be close to the binary decision tree that would have been constructed with the entire training database. This xe2x80x9csmall-samplexe2x80x9d binary decision tree constructed from the subset may be then used to coordinate an efficient review of the entire training database that reduces accesses to the memory in which it is stored.
Specifically, the present invention provides a method of data mining using a computer system having a first low-access memory holding a training database of a plurality of records having attributes and a second high-access memory smaller than the first memory. A subset of the training database is loaded into the second memory and the computer operates on that subset to prepare an initial binary decision tree having nodes associated with confidence intervals defining ranges of the attributes expected in the final binary decision tree for the entire training database. The entire training database is then read from the first memory against the confidence intervals of the binary decision tree to collect split point statistics related to the location of a split point within the confidence intervals. Using the split point statistics, a split point is assigned to each node.
Thus it is one object of the invention to speed the construction or updating of binary decision trees from large training databases. By using a subset of the training database to develop an initial binary decision tree, access to the first memory is substantially reduced. The initial binary decision tree may provide a confidence interval at each node indicating a probable location of a split point. Records within the confidence interval (needed to determine the exact split point) are small in number and may be stored in high access memory after a single scan of low access memory.
The method may include the step of reviewing the entire training database of the first memory against a plurality of bucket intervals outside the confidence interval.
This it is another object of the invention to collect error statistics which confirm that the confidence interval is correct or to catch those few situations where the initial confidence interval is erroneously selected and thus to provide a truly deterministic method of obtaining a binary decision tree from an arbitrarily large training database.
The binary decision tree may assign classification of records at its leaf nodes and the method may include the further step of applying an unclassified record to the binary decision tree to determine its classification.
Thus it is another object of the invention to provide a binary decision tree useful for data mining such as establishes the classification of records for which classification is not known.
The first memory may be a disk drive and the second memory a solid state memory system and the review of the entire training database may sequentially access each record of the training database only once.
Thus it is another object of the invention to provide a method of building binary decision trees that work efficiently with present day high and low access memory systems.
The initial binary decision tree may be generated by first generating a plurality of binary decision trees from samples of the subset, each binary decision tree having nodes associated with attributes. These multiple binary decision trees may be combined to form the initial binary decision tree by overlaying the multiple binary decision trees and discarding subtrees of the multiple binary decision trees of nodes having different attributes.
Thus it is another object of the invention to provide a method of creating an initial binary decision tree suitable both for records with numerical attributes and categorical attributes recognizing that most attribute databases will be mixed categorical and numerical attributes.
The foregoing and other objects and advantages of the invention will appear from the following description. In the description, reference is made to the accompanying drawings which form a part hereof and in which there is shown by way of illustration a preferred embodiment of the invention. Such embodiment does not necessary represent the full scope of the invention, however, and reference must be made to the claims herein for interpreting the scope of the invention.