Data mining encompasses a broad collection of computer intensive methods for discovering patterns in data for a variety of purposes, including classification, accurate prediction and gaining insight into a causal process. Typically, data mining requires historical data in the form of a table consisting of rows, also known as instances, records, cases, or observations, and columns, also known as attributes, variables or predictors. One of the attributes in the historical data is nominated as a “target” and the process of data mining allows the target to be predicted. A predictive model is created using data, often called the “training” or “learning” data.
From data mining, models are created that can be used to predict the target for new, unseen, or future data, namely data other than the training or learning data, with satisfactory accuracy. Such models can assist decision makers and researchers to understand the causal processes generating the target. When a database attribute records membership in a class, such as “good vs. bad” or “A vs. B vs. C,” it is known as a nominal attribute, and when the objective of the data mining is to be able to predict this class membership (the “class label”), the activity is known as classification. When the target attribute of the data mining process records a quantity such as the monetary value of a default, or the amount charged to a credit card in a month, the modeling activity is called regression.
Banking is an example of an application of data mining and regression trees. In banking, data mining classification models may be used to learn the patterns that can help predict, for example, whether an applicant for a loan is likely to default, make late payments, or repay the loan in full in a timely manner (default vs. slow pay vs. current). Regression trees may be used to predict quantities such as the balances a credit card holder accumulates in a specified period of time.
In marketing, another example application of data mining and regression trees, classification models may be used to predict whether a household is likely to respond to an offer of a new product or service (responder vs. non-responder). Regression models may be used to predict how much a person will spend in a specific product category. Models are learned from historical data, such as prior bank records where the target attribute has already been observed, and the model may be used to make predictions for new instances for which the value of the target has not yet been observed. To support predictions based on the model, the new data contain at least some of the same predictive attributes found in the historical training data. Data mining can be applied to virtually any type of data that has been appropriately organized. Examples of areas in which data mining has been used include credit risk analysis, database marketing, fraud detection in credit card transactions, insurance claims and loan applications, pharmaceutical drug discovery, computer network intrusion detection, and fault analysis in manufacturing.
The performance of a data mining or predictive model can be measured in a variety of ways. Several standard criteria have become established in the literature. For classification models these criteria include: Cross-Entropy or Log-Likelihood, Area Under the Receiver Operating Characteristic Curve (“ROC”), Classification Accuracy, Lift in a specified percentile of a dataset, such as the top decile, and weighted measures of the costs of misclassification, where different classification mistakes have different consequences such as when rejecting a reliable borrower is not as serious a mistake as accepting an unreliable borrower. For example, Classification Error has been the criterion most commonly used in prior art. Most discussions of optimal tree selection have referenced this criterion because of its simplicity. For regression, the measures include Sum of Squared Errors, including R-squared and Mean squared Error, Sum of Absolute Deviations, including Mean and/or Median Absolute Deviation, and the Huber-M class of measures, which are a hybrid of both squared and absolute deviations. For the purposes of this disclosure, the method selected for measuring model performance does not matter. The described methods and systems are applicable using any of these common methods. In general, for decision making purposes, it is the performance of the model on test data or on previously unseen data that is considered definitive.
Models that perform substantially better on the training data than on appropriate unseen data are said to be overfit, and a key to building reliable data mining models is finding models that are not overfit. The data mining literature, as well as the statistical literature that preceded it, contains a number of scientific studies addressing the issue of the avoidance of overfitting. The field of decision trees contains certain techniques intended to avoid overfitting.
One older prior art method still in use for decision trees is the stopping rule. Stopping rules are applied during the tree-growing process and are used to decide whether a branch of the tree should be allowed to grow further or whether it should be forced to terminate. When all branches have been forced to terminate, the tree and the model are considered complete. The stopping rule is intended to detect when the evidence in the training data for further elaboration of the tree is so weak that stopping is preferable to further growing of the tree. Weakness may be measured by a statistical criterion, such as a chi-square test. The expectation is that the stopping rule will yield trees that are not overfit.
Another class of prior art methods to avoid overfitting relies on pruning. Trees are first grown without a stopping rule and thus grown intentionally large and almost certainly overfit. The pruning process is applied after the growing is complete and is intended to prune away the overfit parts of the tree. When optimal trees are selected via training sample pruning, each branch of the tree is pruned to a length that has been determined to be best based on an evaluation criterion calculated using the training data only.
A number of variations on this theme have been proposed in the literature. One example, the popular C4.5 decision tree uses the upper bound of the confidence interval based on the binomial distribution to derive a pessimistic estimate of the unseen data error rate in each node.
This prior art pruning approach to optimal tree selection proceeds as follows. First, a large, probably overfit decision tree is grown. The pruning process then works backwards from the bottom of the tree. For each parent of a terminal node or nodes, a comparison is made between the pessimistic estimate of the error rate when the parent is allowed to be split and the estimate when the parent node is not split and instead made terminal. A split is retained if it results in a smaller estimated pessimistic error rate than would be obtained if the split were removed. As splits are pruned away, the tree becomes progressively smaller. This process of testing a node to see if it should be allowed to be split is repeated moving upwards through the tree.
This method does not make use of test data or cross validation, relying instead on internal calculations based on training data. Although the method resembles a node-by-node comparison method, the pruning procedure is intended to yield a tree with the best estimated pessimistic error rate overall. Thus, the method is focused on the reduction of classification error and not on a notion of agreement of specific predictions made within a decision tree node. The same prior art method can be adapted to the case where plentiful test data are available so that reliable estimates of the error in any portion of the tree can be based on test data.
In a 1987 article entitled “Simplifying Decision Trees,” Quinlan briefly described such a method without elaboration, observing simply that test data could be used for node classification accuracy measurement and therefore tree selection. He advised against this approach for two reasons. First, a test data-based tree selection procedure requires more data, which may be a burden. Second, pruning on the basis of test data performance may prune away valuable parts of the tree that happen to receive little test data in a specific test sample. Instead, the method described by Quinlan focused on reducing classification error and not on achieving a match or agreement between train and test data.
In the classical prior art decision tree algorithm of Breiman, Friedman, Olshen, and Stone, the pruning process is used not to identify an optimal tree but to identify a set of candidate trees of different sizes. As in other pruning procedures, smaller trees are obtained from larger trees by removing selected nodes. For a tree of a given size, the CART cost-complexity pruning process identifies which node or nodes are to be pruned to arrive at a smaller tree.
In many cases a pruning step removes just one split from the bottom of the tree but in some circumstances more than one split may be removed by the specific cost-complexity pruning formula. The process of cost-complexity pruning is intended to effect the smallest useful reduction in tree size in any one pruning step, and the process is repeated until all splits have been pruned away. The cost-complexity pruning process thus defines a collection or sequence of nested trees, each obtained from its predecessor by pruning away some nodes. The optimal tree is then identified as the one tree in this sequence with the best performance on either explicit test data or testing via cross validation.
Consequently, in the CART decision tree, a test sample or testing via cross validation is required to complete the tree selection process once the pruning has been accomplished. This is in contrast to the C4.5 methodology that uses a pruning criterion that does not refer to test data. The details of the pruning process are not relevant because the methodology described herein is applicable no matter how the tree is pruned or indeed whether the tree is or is not pruned. Other pruning methods based on a measure of model complexity have also been proposed.
Several prior art methods, including the previously described methods, produce decision trees that often perform roughly as expected on unseen data. For example, a tree expected to have a classification accuracy of 75% based on training data might exhibit a roughly similar accuracy of say 72% on unseen data. However, decision makers may consider the performance of such apparently successful prior art optimal models unsatisfactory when the individual terminal nodes of the tree are examined.
That is, while the tree as a whole may yield the expected number of correctly classified cases, the number of correctly classified cases in a specific node or nodes of interest may be far from the number expected. Because the model as a whole has a performance that is a weighted average of the performances within the terminal nodes, underperformance in some nodes may be compensated for by overperformance in other nodes of the tree, yielding a performance that is considered satisfactory overall.
Decision makers, however, may insist that every node in the tree achieve certain minimum performance standards and that deficiencies in one node may not be compensated for by excellent performance in other nodes. Alternatively, decision makers often deploy only selected portions of a decision tree. Thus, for practical purposes the performance of selected portions of the tree may be what is relevant rather than the performance of the tree overall. The methods and systems described herein make use of detailed examination of performance in specific nodes of a decision tree. Thus, a tree that is deemed optimal under prior art methods may be deemed far from optimal under the methods and systems described herein.
A specific example can be taken from the field of marketing. Marketing decision makers often search for customer segments that are above average in their propensity to respond to certain product and price offers, commonly known as responder segments. The terminal nodes of a decision tree will yield two vital pieces of information to the decision maker: whether the segment in question is above average, in other words whether the lift in the segment exceeds 1.0, and the expected response rate. A segment that appears to be a responder segment on the training data, having a lift greater than 1.0, but which is shown to be a non-responder segment on test data, having a lift less than 1.0, would be rejected by a decision maker as suspect. A disagreement between the train and test results regarding where a node sits relative to this threshold is core to certain decision makers.
Decision makers may well demand that an acceptable tree cannot exhibit a train/test disagreement regarding the responder or no-responder status of any node. Further, when looking at a market segment defined by a decision tree, marketers would not regard a segment as satisfactory if the discrepancy between the train and test responsiveness in that node is substantial, or if the rank order of that segment differed substantially between train and test data. Prior art methods of optimal tree selection cannot guarantee trees that will meet these train/test consistency requirements.
Consequently, there is a need for methods and systems that address the aforementioned shortcomings of prior art automatic tree selection methods.