The tremendous growth of data amassed by organizations and enterprises has spurred the development of analytics systems to extract insights and actionable intelligence from the data. Machine learning is a field of analytics that configures computers with the ability to get better at a task as it performs more of the task, in other words, learning from repetitively doing the task. In the past, statisticians had developed various techniques, such as sampling, and summarization using parameters like mean, median, standard deviation, etc., while dealing with very large datasets due to limited computing resources. However, the availability of large computing resources at relatively low cost has enabled practitioners of machine intelligence to analyze all of the data to extract useful insights.
Machine learning involves configuring a computer system to learn from experience with respect to some task and some performance measure. As a result of machine learning, a computer system's performance improves with experience in the course of its operation. For example, consider the task of predicting customers who are likely to drop out of a service offered by a company, otherwise called churn prediction. Here the task is the prediction of those customers who are likely to drop out. Performance is the prediction accuracy, i.e., what percentage of the customers who were predicted to drop out actually did. And experience is the data that is provided to the prediction software on how well its prediction worked. So, a system exhibiting machine learning should improve its prediction accuracy over time as it is used in operation. The target operation can be predicting a continuous valued variable, such as forecasting inventory required for a store, or predicting one of a few discrete outcomes, such as whether a customer might drop out or not. The first case is known as a regression problem, whereas the second is called a classification problem.
There are a number of typical technical challenges in developing a machine learning system to work effectively for a given application. Choosing the appropriate datasets, cleansing the data and loading the data into appropriate repositories for analysis ordinarily is a first important step. The datasets made available are often dictated by the task. Another important step is selecting a feature set that adequately represents the data for the prediction task. In many applications, the data may consists of hundreds or even a thousand fields such as representing customer transaction history, credit scores, agent profiles, etc., for example. There is very often redundancy in such high-dimensional data. Thus, it often is necessary to reduce the dimensionality of the data before applying classification or regression analysis.
Once an appropriate feature vector is developed, another task is to develop a suitable classifier or regression model for the ultimate prediction task. There are a number of classifiers available in the literature, such as logistic regression, neural networks, support vector machines (SVM), etc. Models are initially trained using the training data available. As a system is deployed in the field, newly generated data from the results of prediction can be fed back to the system to train the models further. This is where the machine learning ordinarily comes in.
Over time, the nature and statistics of the data may change. This may make the prediction models less effective, necessitating in an update of the models. This phenomenon has been referred to as model decay. The model decay is addressed by updating the model from time to time, for example, every three months or annually, using more recent data. The decision to update the model often is made a priori by a business owner, for example, without quantitatively analyzing the effectiveness of the model with respect to the changing statistics of the run-time data. The collection of more recent data, updating of the model, and deployment of the new model are manual processes and may take weeks to months to complete. As a result, the manual updating of the model is expensive, inefficient and suboptimal.
So, how often and when should the models be updated? This is often a daunting task for many analytics deployments and companies. When and how to update the model is the primary problem we are addressing. The model needs to be updated when the prediction accuracy falls below an acceptable level. To compute the prediction accuracy, one needs to have the actual outcomes, compare them with the predicted outcomes, and determine what percentage was correct; please see the FIG. 2 for a bock diagram level representation of a system that relies upon actual results to update models. The storage buffer in the FIG. 2 is to collect enough actual data points on the accuracy of the model before triggering a model update. However, it may take weeks or months for the actual outcomes to be known, as in the case of loan payment default or customer retention applications. In the meantime, the model may be performing at a much lower accuracy level due to model decay. The challenge then is to determine how well the model is doing even before the actual outcomes, and hence the accuracy, are known. The method we disclose works equally well for classification and regression problems.