1. Field of the Invention
The present invention generally relates to methods for building predictive models from a given population of data points, and more particularly to a family of methods known as boosting or adaptive sampling, wherein multiple models are constructed and combined in an attempt to improve upon the performance obtained by building a single model. The invention resolves the interpretability problems of previous boosting methods while mitigating the fragmentation problem when applied to decision trees.
2. Background Description
Predictive modeling refers to generating a model from a given set of data points (also called xe2x80x9cexamplesxe2x80x9d or xe2x80x9crecordsxe2x80x9d), where each point is comprised of fields (also called xe2x80x9cattributesxe2x80x9d or xe2x80x9cfeaturesxe2x80x9d or xe2x80x9cvariablesxe2x80x9d), some of which are designated as target fields (also called xe2x80x9cdependent variablesxe2x80x9d) whose values are to be predicted from the values of the others (also called xe2x80x9cindependent variablesxe2x80x9d).
There are many known methods of building predictive models from data for many different families of models, such as decision trees, decision lists, linear equations, and neural networks. The points used to build a model are known as training points. The performance of a model is judged by comparing the predicted values with the actual values of target fields in various populations of points. It is important to perform well on xe2x80x9cnewxe2x80x9d or xe2x80x9cunseenxe2x80x9d points that were not used to train the model. In practical applications, the values of target fields become known at later times than the values of other fields. Actions based on accurate predictions can have many benefits. For example, a retailer can mail promotional material to those customers who are likely to respond favorably, while avoiding the costs of mailing the material to all customers. In this example, customer response is a target field. Other fields might include past responses to similar promotions, other past purchases, or demographic information.
Given some method of building predictive models, one could simply apply the method once to the available training points, or to as large a sample from that population as the underlying method can work with. Boosting refers to a family of general methods that seek to improve the performance obtained from any given underlying method of building predictive models by applying the underlying methods more than once and the combining the resulting xe2x80x9cweakxe2x80x9d models into a single overall model that, although admittedly more complex than any of the xe2x80x9cweakxe2x80x9d models obtained from the underlying method, may make more accurate predictions. The term xe2x80x9cweakxe2x80x9d, as used in connection with boosting, is a technical term used in the art; a xe2x80x9cweakxe2x80x9d model is simply a model with imperfect performance that one hopes to improve by somehow combining the xe2x80x9cweakxe2x80x9d model with other xe2x80x9cweakxe2x80x9d models built by the same underlying method, but from different samples of the available training points. A model with good performance may still be considered xe2x80x9cweakxe2x80x9d in this context. Boosting is a process of adaptive resampling that builds a weak model, determines how to choose another training sample by observing the performance of the weak model(s) already built, builds another weak model, and so on.
FIG. 1 illustrates the end result of the common method of boosting. A list of admittedly weak models (1001, 1002, . . . , 1099) is available, and every one of those models is applied to any given point for which a prediction is wanted in 1000. The actual number of weak models could be more or less than the 99 indicated in FIG. 1. Tens or hundreds of weak models are commonly used. Such an ensemble of models comprising the final weighted model is not humanly understandable. To return a single prediction, a weighted average of all the weak predictions is computed in block 1100. As commonly practiced (see, for example, Y. Freund and R. Schapire, xe2x80x9cExperiments with a New Boosting Algorithmxe2x80x9d, Proceedings of the International Machine Learning Conference, pp. 148-156 (1996), or R. Schapire and Y. Singer, xe2x80x9cImproved Boosting Algorithms Using Confidence-Rated Predictionsxe2x80x9d, Proceedings of the 11th Annual Conference on Computational Learning Theory (1998)), boosting builds both a list of weak models (1001, 1002, . . . , 1099) and a corresponding list of weights for averaging the predictions of those models, as in step 1100.
The method of building the lists is a form of adaptive resampling, as illustrated in FIG. 2. Given a population of training points 2001 and a trivial initial probability distribution D_12011 where all points are equally probable, any given model-building method yields a model M_12012 to serve as the first weak model. If the given model-building method cannot deal directly with a large population of training points together with a probability distribution on those points, then a sample of any convenient size may be drawn from the population 2001 according to the distribution 2011. To determine the weight of the predictions of M_1 and to build the next weak model, the next probability distribution D_22021 is computed by observing the weighted average performance of M_1 on the entire population 2001, with the performance for each point weighted by the probability of that point according to the current distribution 2011. The function that computes the weight of M_1 will reward better performance with the higher weight for M_1, while the function that computes the next probability of each point ensures that points where M_1 performs poorly will be more likely to be chosen than are points with the same current probability where M_1 performs well.
Given the same population 2001 and the new probability distribution 2021, the same given model-building method yields a model M_22022 to serve as the second weak model. The process of observing performance and determining both a weight and a new distribution continues for as long as desired, leading eventually to a final distribution D_992991 and then a final weak model M_992992. The last step in boosting as commonly practiced is a truncation of the repeated process: observed performance determines the weight of M_99, but the computation of a new distribution is omitted. Boosting as commonly practiced will be called xe2x80x9cadditive boostingxe2x80x9d hereafter.
After additive boosting, the final model is of a form unlike that of the models in the list of admittedly weak models, and the list of weights is difficult to interpret. Despite considerable experimental success, additive boosting is, from the viewpoint of the end user, disturbingly like the old fable about a committee of blind men who independently examined various parts of an elephant by touch and could not pool their admittedly limited observations into a consensus about the whole animal, as illustrated in FIG. 3. Man 31 feels the elephant""s leg 32 and assumes he has encountered a tree. Man 33 feels the elephant""s leg 34 and assumes he has encountered a snake. The two men are unable to conclude that they have both encountered an elephant 35.
This interpretability problem is well-known. A large complex model, such as a typical boosted model, with a whole ensemble of base models and their weights, is difficult to understand and explain. This limits the scope of practical applications. There have been attempts to mitigate the interpretability problem with visualization tools applied after models have been built (J. S. Rao and W. J. E. Potts, xe2x80x9cVisualizing Bagged Decision Treesxe2x80x9d, Proceedings of the Third International Conference on Knowledge, Discovery and Data Mining, (KDD-97), pp. 243-246 (August 1997). The interpretability problem has been addressed for the special case of Naive Bayes classification as the underlying method of building models and xe2x80x9cweight of evidencexe2x80x9d as the desired interpretation (G. Ridgeway, D. Madigan, T. Richardson, and J. O""Kane, xe2x80x9cInterpretable Boosted Naive Bayes Classificationxe2x80x9d, Proceedings of the Fourth International Conference on Knowledge, Discovery and Data Mining, (KDD-98), pp. 101-104 (August 1998).
It is therefore an object of the present invention to provide a new method of boosting of predictive models, called cascade boosting, for resolving the interpretability problem of previous boosting methods while mitigating the fragmentation problem when applied to decision trees.
According to the invention, there is provided a method of cascade boosting, a form of adaptive resampling, which always applies a single weak model to any given data point. A significant improvement to the common method of boosting lies in how the weak models are organized into a decision list. The decision list is typically smaller than the lists of models and weights generated by the prior art, thus making it easier to interpret the correlations among data.
Each list item before the last item specifies a (possibly complex) condition that a data point might satisfy, along with a unique (possibly weak) model to be applied to any point that satisfies the condition but does not satisfy any conditions from earlier in the list. The list is terminated by a last item that has no condition and merely specifies the model to be applied if none of the conditions in earlier items are satisfied. Various methods of building decision lists are known. Cascade boosting is a new method for building a decision list when given any method for building (possibly weak) models.
Cascade boosting is simplest when applied to segmented predictive models but may also be applied to predictive models that do not explicitly segment the space of possible data points, for instance neural nets. The word xe2x80x9cpredictivexe2x80x9d is omitted hereafter because all models considered here are predictive.
Cascade boosting of segmented models may be applied to decision trees or any other kind of model that segments the space of possible data points into (possibly intersecting) regions (also known as segments) and utilizes a distinct subordinate model for each region, or segment. The regions are often chosen by various means intended to optimize the performance of the overall model.
Decision trees are the most common kind of segmented model. In the common case of decision trees, the tests performed along paths to regions with models that perform well may fragment the space of possible data points. Fragmentation separates points outside the regions with good models, assigning the fragmented points to disjoint small regions that cannot be modeled well in the presence of noise.
In the common case of decision trees, cascade boosting preserves the relatively successful leaves of the tree while reuniting fragments formed by relatively unsuccessful leaves. In the more general case of a segmented model with possibly intersecting segments, a boosted model may be simpler and/or more predictive than the original model generated by the underlying segmentation process on which cascade boosting has been superimposed.
In the most general case of a model that is treated like a black box, cascade boosting replaces a single black box by a cascade of black boxes, each with a corresponding test to decide whether to apply the model gated by the test or continue to the next stage in the cascade. The boosting process itself applies any of the available methods for segmented classification, so as to identify regions where the last stage in the current state of the evolving cascade performs well.