Data mining encompasses a broad collection of computer intensive methods for discovering patterns in data for a variety of purposes, including classification, accurate prediction and gaining insight into a causal process. Typically, data mining requires historical data in the form of a table consisting of rows (also known as instances, records, cases, or observations) and columns (also known as attributes, variables or predictors). One of the attributes in the historical data is nominated as a “target” and the process of data mining allows the target to be predicted. A predictive model is created using data, often called the “training” or “learning” data.
From data mining, models which can be used to predict the target for new, unseen, or future data (i.e. data other than the training data) with satisfactory accuracy and models which assist decision makers and researchers understand the casual processes generating the target are created. When a database attribute records membership in a class, such as “good vs. bad” or “A vs. B vs. C”, it is known as a nominal attribute, and when the objective of the data mining is to be able to predict this class (the “class label”), the activity is known as classification. When the target attribute of the data mining process records a quantity such as the monetary value of a default, or the amount charged to a credit card in a month, the modeling activity is called regression.
In banking, data mining classification models are used to learn the patterns which can help predict if an applicant for a loan is likely to default, make late payments, or repay the loan in full in a timely manner (default vs. slow pay vs. current), and regression trees are used to predict quantities such as the balances a credit card holder accumulates in a specified time period. In marketing, classification models are used to predict whether a household is likely to respond to an offer of a new product or service (responder vs. non-responder), and regression models are used to predict how much a person will spend in a specific product category. Models are learned from historical data, such as prior bank records where the target attribute has already been observed, and the model is then used to make predictions for new instances for which the value of the target has not yet been observed. To be able to support predictions based on the model, the new data must contain at least some of the same predictive attributes found in historical training data. Data can be applied to virtually any type of data once it has been appropriately organized, and data mining has been used extensively in credit risk analysis, database marketing, fraud detection in insurance claims, loan applications, and credit card transactions, pharmaceutical drug discovery, computer network intrusion detection, and fault analysis in manufacturing.
FIG. 1 illustrates a prior art process of the learning of a data mining model. In the first step, training data, including target attribute and potential predictor attributes, is organized 110. Once organized, the training data is provided to modeling algorithms for classification or regression 120. From the training data, the modeling algorithms produce one or more classifiers or predictive regression models 130. After the models have been produced, the models are then embedded in decision support systems 140 which are used to classify, predict, and help guide decisions. A typical use of the decision support system is illustrated in FIG. 2. Data records that do not contain values for the target attribute 210 are provided to the decision support system 220 to make predicted values for the target attribute 230. FIG. 3 illustrates an extract from a sample training data set (or table) appropriately organized for data mining. The original data used to create this table may have been stored in a different form in one of many database management systems.
Decision trees are one of the major data mining methods and they are prized for their interpretability and comprehensibility. Typically, decision trees are built by recursive partitioning which begins by incorporating the entire training data set into a starting or root node. Once incorporated, an attempt is made to partition this node into at least two mutually exclusive and collectively exhaustive sub-nodes (child nodes) using a single attribute X (the splitter), with the goals of separating instances with different target attribute values into different child nodes. A child node contains all instances corresponding to a given region of values of the splitter. A region is a layer, layers or parts of layers of a decision tree. For continuous attributes, such regions are contiguous and defined by upper and lower bounds, for example, L<=X1<U, which defines a region in which the attribute X1 is greater than or equal to L and strictly less than U. For nominal attributes the region is defined by a list of attribute values (for example, a region could be defined by {X2=“AA” or X2=“BB” or X2=“CC”}). Typically, an exhaustive search is made over all possible ways in which an attribute can be used to split the node and each partition is evaluated using some goodness of split criterion, such the gini, entropy, or statistical measure such as an F-statistic or chi-squared statistics.
The best split for a given attribute, as evaluated on the goodness of split measure, is saved for future reference. The search for the best split is repeated for every attribute and the attribute yielding the best overall partition is declared the splitter of the node. The data is then partitioned in accordance with the best split. Some decision trees, such as CART@, split nodes into no more than two child nodes, whereas other decision trees, such as CHAID and C4.5, permit a node to be split into more than two child nodes. Once the root node has been split, the splitting process is repeated separately in each child node, so that the child nodes become parents producing “grand children”. The process is repeated again and again until a termination criterion is met. For some decision trees, the termination criterion is a statistical “stopping rule”, whereas for others, such as CART®, the splitting process is stopped only when it is not possible to continue, for example, due to running out of data to split, or impractical to continue due to resource limitations such as computing time or disk storage. It is impossible to split a node containing only one instance, or a node all of whose instances have the same value for the target. It may be impractical to split a node containing a small number of instances, for example, fewer than 10. Once a tree has been grown, it may be subjected to a “pruning” process in which some splits at the bottom of the tree are removed to yield a smaller tree containing fewer nodes. Pruning may be applied repeatedly, progressively making the tree smaller, and may be continued until the entire tree is pruned away. The purpose of pruning can be to produce a tree which performs satisfactorily on unseen data or to produce an interpretable tree which can be used as a component of a decision support system.
Turning to FIG. 4, the process by which a prior art decision tree is grown is illustrated. First, appropriate data is made available, including the identification of the target attribute and the eligible splitter attributes, and the current region of the tree is set to zero 410. Once the appropriate data is made available, a search of all available attributes to find the best splitter for every node at the current region is performed 420. If any node at the current region is splittable 430, then the data in such nodes are partitioned, and the region of the tree is incremented by 1 440. If there are no nodes that are splittable at the current region, then the tree growing process terminates 450. The tree generated by this growing process may or may not be the final tree. Some decision tree methods follow the tree growing process by a tree pruning process (C4.5), and some follow the tree growing process by tree pruning, tree testing and selection processes (for example, CART®).
The final decision tree, whether produced by growing only, or by growing and pruning, has the form of a flow chart or decision diagram, as illustrated in FIG. 5. The “root node” appears at the top of the diagram 510 and is the starting point from which the diagram is read. A determination of which child node to proceed to is based on the result of a logical test. In the example shown in FIG. 5, each logical test admits of a “yes” or “no” answer, but tests allowing branches to more than two child nodes are permitted in some decision trees. A record moving to the right arrives at a terminal node at which a prediction or classification is made by the tree 530. A record moving to the left is subjected to another logical test 520 leading either to a terminal node on the left 550 or to another logical test 540. The final logical test leads to the terminal nodes 560 and 570. The decision tree thus reflects a set of ordered logical tests. The terminal at node 570 specifies that if a record satisfies the root node condition on attribute 1, and does not satisfy the condition 520 on attribute 2, and does not satisfy the condition on attribute 3 at 540, then a specific prediction will be made. For regression trees, these predictions will be real numbers, and for classification trees, the predictions may be either class labels (for example, “this record is a responder”) or a set of probabilities (for example, “this record is a responder with probability p and a non-responder with probability q), or both class labels and probabilities. The logic displayed in the decision tree is often taken to reveal the underlying causal process by which the outcome is produced.
The predictive model produced by a prior art decision tree may be unsatisfactory for a variety of reasons. First, the decision tree may use splitters in an order which appears to be illogical or contrary to the causal order in which factors are believed to operate. Second, the decision tree may be difficult to understand, and interpret because the causal factors of different types are mixed in what appears to be an arbitrary order. Third, the decision tree may appear to reflect a decision logic that is in conflict with accepted scientific belief or the convictions of experienced decision makers. What is needed is a constrained or structured tree method that controls which attributes may be used as splitters and that specifies the conditions under which an attribute is allowed to act as a splitter or surrogate or alternate splitter.