The invention is related to data mining and, in particular, to a method for determining a time for retraining a data mining model.
Data mining is the analytic process of extracting meaningful information from new data, often referred to as ‘application data’, wherein the ultimate goals are prediction and detection of non-obvious knowledge. Data mining is often associated with data warehousing as the data mining process may need to access large amounts of data. Data mining is designed to explore data in search of systematic relationships and consistent patterns between variables. When such relationships or patterns are uncovered, they can be validated by applying detected patterns to new subsets of data.
Predictive data mining typically comprises the stages of: initial exploration, model building and validation, and deployment. As understood in the relevant art, deployment is the process of applying a data mining model to the new data in order to generate the predictions. Once the model building has been accomplished, the data mining model may be verified and deployed. The data mining model is continuously applied to new data, different from the original training data it was built on. Thus, the statistical properties of the data to which the data mining model is applied may often differ from the original training data, or may start to diverge from the original training data over time.
Even if the new data to which the data mining model is applied initially has the same statistical properties as the training data, the application data may change over time as underlying trends take effect. The data mining model may be built on data that presents only a snapshot in time, ignores these underlying trends, and the data mining model becomes outdated. Another factor is that the data to which the data mining model is deployed may represent only a subset of the data that the data mining model was built on. For example, a data mining model that was trained on data representative of an entire population may be deployed at an institution where only a certain subset of that population exists, or may be deployed in a certain geographical region. Of course, such factors can also occur in combination.
Accordingly, a data mining model has to be retrained when the difference between the application data and the training data becomes great enough to affect the accuracy of the predictions. It is not obvious, however, when to retrain as retraining and establishment of a new data mining model is resource consuming and should not be done unnecessarily. Furthermore, the cost to roll out a new data mining model can be substantial.
It is reasonable to determine before or at deployment whether the data mining model is suitable for use with the current set of application data. If the data mining model is outdated and is no longer reliable, this criterion can be used to begin the training of a new data mining model. In one conventional approach, the data mining model may be periodically retrained, i.e. retraining the model whenever a certain time frame has elapsed. However, by using this approach, retraining may occur either too often, thus increasing costs, or too infrequently, thus yielding severe mispredictions.
Another conventional approach is to monitor distribution parameters of the application data exclusively. The distribution parameters are initialized with the first set of data that the model is applied to, instead of parameters of the training data. For example, every time a data mining model is applied to a new portion of data, statistics are computed. These may include averages and, optionally, distributions for every one of the data mining model's predictor variables. The statistics for the new portion of data should not diverge significantly from previous statistics, and especially not from the statistics run off the original dataset used to validate the data mining model.
The extent to which divergence has occurred is the extent to which model deterioration is likely to be encountered. Sudden, dramatic divergence generally is the result of a change in the structure of the source data. Gradual divergence often is symptomatic of a change in the dynamics of the data source. However, if the data to which the data mining model is deployed are already sufficiently different from the training data set, the method of monitoring distribution parameters of the application data exclusively will not immediately trigger retraining a model. Furthermore, it may depend on seeing a sufficient data change, which can cause significant delays in detecting the change in the data, leading to unnecessary mispredictions.
Yet another conventional approach is to train multiple data mining models, i.e. a model for each possible subset of the data population, instead of a single model for the whole population. One drawback to this conventional approach is that the increased number of data mining models can lead to complications in the model management. Also, it is often desirable to build a data mining model on the entire data set since it will behave more robustly against local extremes in the data.
From the above, it is clear that there is a need for a method to provide data mining operations which more reliably determines when to either retrain a data mining model or train a new data mining model.