This invention relates generally to data mining software.
Data mining software extracts knowledge that may be suggested by a set of data. For example, data mining software can be used to maximize a return on investment in collecting marketing data, as well as other applications such as. credit risk assessment, fraud detection, process control, medical diagnoses and so forth. Typically, data mining software uses one or a plurality of different types of modeling algorithms in combination with a set of test data to determine what types of characteristics are most useful in achieving a desired response rate, behavioral response or other output from a targeted group of individuals represented by the data. Generally, data mining software executes complex data modeling algorithms such as linear regression, logistic regression, back propagation neural network, Classification and Regression (CART) and Chi2 (Chi squared) Automatic Interaction Detection (CHAID) decision trees, as well as other types of algorithms on a set of data.
Results obtained by executing these algorithms can be expressed in a variety of ways. For example, an RMS error, R2 value, confusion matrix, gains table or multiple lift charts or a single lift chart with multiple lift curves. Based on these results the decision maker can decide which model (i.e., type of modeling algorithm and learning parameters) might be best for a particular use.
In many real world modeling problems, often a single variable or set of input variables can have a significantly strong influence on predicting behavioral outcomes. The data mining software allows for execution of multiple models based on selective segmentation of data using models designed for and trained with the particular data segments. When the models operate on each of the data segments, they can produce a simple lift chart to show the performance of the model for that segment of data.
While a single lift chart may provide useful results, the single lift chart does not indicate the usefulness of the multiple model approach. A single lift chart does not indicate how the multiple models should optimally combined and used. In addition, the performance of individual models based on data segmentation can not be directly compared to that of a single, non-segmented model, to determine whether the improvement, if any, exhibited with the multiple data segment modeling approach justifies the additional modeling expenses associated therewith.
The scores generated for these models cannot be simply sorted from among different models when a priori data distributions have been modified. This is typical in problems such as response modeling, when a class or behavior of interest represents a small sample of the overall population (e.g., response rates are typically 1-2%). Scores cannot be simply combined and sorted from multiple models because the scores no longer represent probabilities of the predicted outcomes. Records from a less represented class (e.g., responders to a mailing campaign) are typically over sampled relative to the other class (e.g., non-responders). While this sampling technique provides improved prediction accuracy, the model scores for many data-driven algorithms no longer map directly to probabilities and therefore cannot be easily combined from multiple models.
According to an aspect of the present invention, a method executed on a computer for modeling expected behavior includes scoring records of a dataset that is segmented into a plurality of data segments using a plurality of models.
According to a further aspect of the present invention, a computer program product residing on a computer readable medium for modeling expected behavior includes instructions for causing a computer to score with a plurality of models records of a dataset that is segmented into a like plurality of data segments.
According to a further aspect of the present invention, a method executed on a computer for modeling expected behavior includes scoring records of a dataset that is segmented into a plurality of data segments using a like plurality of models and combining results obtained from scoring the records into a single representation of the expected behavior.
According to a further aspect of the present invention, a computer program product residing on a computer readable medium for modeling expected behavior includes instructions for causing a computer to score with a plurality of models records of a dataset that is segmented into a like plurality of data segments and combine results obtained from scoring the multiple models into a single representation of the expected behavior.
One of more of the follow advantages are provided the one or more aspects of the invention. Multiple model executions on segmented data provides a technique to avoid the significantly strong influence on predicting behavioral outcomes that a single variable or set of input variables may have on a modeling problem. A summary lift chart that combines results from multiple model executions on segmented data provides a technique to allow a decision maker to see the expected performance of modeling from all combined models and determine whether the improvement justifies the additional modeling expense. In addition, this technique applies to all algorithms that do not generate scores representing probabilities.
In addition, the approach set out above allows for modeling real world modeling problems where a single variable or set of input variables have a significantly strong influence on predicting behavioral outcomes. The approach allows for execution of multiple models based on selective segmentation of data using models designed for and trained with the particular data segments. With the results combining approach the results from these multiple segmented-model executions are combined into a single, summary representation of the results. The multiple segmented-model executions can be combined into a single, summary representation of the results that maintains an order of results within a model execution while arranging results in descending order among different model executions.