Data mining is the exploration and analysis of large quantities of data, in order to discover correlations, patterns, and trends in the data. Data mining may also be used to create models that can be used to predict future data or classify existing data.
For example, a business may amass a large collection of information about its customers. This information may include purchasing information and any other information available to the business about the customer. The predictions of a model associated with customer data may be used, for example, to control customer attrition, to perform credit-risk management, to detect fraud, or to make decisions on marketing.
To create and test a data mining model, available data may be divided into two parts. One part, the training data set, may be used to create models. The rest of the data, the testing data set, may be used to test the model, and thereby determine the accuracy of the model in making predictions.
Data within data sets is grouped into cases. For example, with customer data, each case corresponds to a different customer. Data in the case describes or is otherwise associated with that customer. One type of data that may be associated with a case (for example, with a given customer) is a categorical variable. A categorical variable categorizes the case into one of several pre-defined states. For example, one such variable may correspond to the educational level of the customer. There are various values for this variable. The possible values are known as states. The states of the educational level variable may be “high school degree,” “bachelor's degree,” or “graduate degree” and may correspond to the highest degree earned by the customer.
Data available is partitioned into two groups—a training data set and a testing data set. Often 70% of the data is used for training and 30% for testing. A model may be trained using only the training data set, which includes the state information. Once a model is trained, it may be run on the testing data set for evaluation. During this testing, the model will be given all of the testing data except the educational level data, and asked to predict a probability that the educational level variable for that customer is a particular state, such as “bachelor's degree”.
Running the model on the testing data set, these results are compared to the actual testing data to see whether the model correctly predicted a high probability of the “bachelor's degree” state for cases that actually have “bachelor's degree” as the state of the educational level variable.
One method of displaying the success of a model graphically is by means of a lift chart, also known as a cumulative gains chart. To create a lift chart, the cases from the testing data set are sorted according to the probability assigned by the model that the variable (e.g. educational level) has the state (e.g. bachelor's degree) that was tested, from highest probability to lowest probability. Once this is done, a lift chart can be created from data points (X, Y) showing for each point what number Y of the total number of true positives (those cases where the variable does have the state being tested for) are included in the X% of the testing data set cases with the highest probability for that state, as assigned by the model.
As shown in FIG. 1, the conventional lift chart shows that there are 1000 total true positives in the testing set. This is not necessarily the number of cases in the testing data set. Some cases may have a different state for the variable than the one being tested. The number of true positives in the testing data set is the highest number shown on Y axis 10. The X axis 20 correlates with the percentage of cases with the highest probabilities. Lift line 30 depicts the success of the model. For example, it can be seen that lift line 30 includes a point with (X, Y) coordinates are approximately (20, 500). This indicates that, in the 20% of the cases selected by the model as the most probable cases having the tested-for state of the variable, approximately 500 of the cases that are truly positive for the state of the variable are included. This is equivalent to getting 50% of the actual cases with the desired state in only 20% of the cases tested for.
A model that randomly assigns probabilities would be likely to have a chart close to the random lift line 40. In the top 10% of cases, such a model would find 10% of the true positives. Note that the X axis may also be expressed in the number of high probability cases, and the Y axis in percentages. A perfect model may also be considered. In a situation where there are N % true positives among the entire testing data set the lift line would stretch straight from the origin to the point (N, YMAX) (where YMAX is the maximum Y value). This is because all of the true positives would be identified before any false positives are identified. The lift line for the perfect model would then continue horizontally from that point to the right. For example, if 20% of the cases had the tested for state, as shown in FIG. 2, a perfect model would have the perfect lift line 50, extending from (0,0) to (20, 1000) and then from (20, 1000) to (100, 1000). Similarly, the worst case model would identify no true positives until the last N % of the testing population is included, and, as shown in FIG. 3 for the case where there are 20% true positives, the worst case lift line 60 for such a model would extend from (0,0) to (80, 0) and then straight from (80,0) to (100, 1000).
As described above, in the prior art, a lift chart can be used to display and measure the prediction accuracy of a model for a given state of a categorical variable. However, existing lift charts do not have any capability for measuring the effectiveness of a model in predicting an association. Additionally, the prior art lift chart can be used to display the prediction accuracy of a model in terms of the percentage of true positives captured in different size groups of cases with the highest associated probabilities, however, there is no capability for understanding what the size of the number of true positives in the testing data set.
Thus, there is a need for a method and system for generating for display improved charts with which to display the accuracy of models.