1. Field of the Invention
The present invention generally relates to a method for building statistical predictive models from massive data sets in a computationally tractable way and, more particularly, to an efficient heuristic approach that uses a simple probability model obtained from the massive training data set to generate simulated pseudo data for most aspects of the overall predictive modeling procedure. This heuristic approach reduces the number of expensive data scans over the massive training data set and, consequently, leads to a significant reduction in the computational cost without greatly affecting the overall accuracy of the statistical modeling procedure.
2. Background Description
Predictive modeling techniques are of central importance in applications such as database marketing, targeted advertising, fraud detection, credit/risk modeling, and information retrieval. In all these applications, the training data sets can be potentially enormous. For example, certain marketing databases that we have encountered contain millions of customer records with thousands of attributes per record. The development of statistical predictive modeling algorithms is of central importance in these applications. However, the computational cost of existing statistical procedures is prohibitively expensive for the massive data sets that are encountered therein.
Therefore, typically in these applications, the computational cost is reduced by using preliminary data reduction prior to the statistical modeling procedure. For example, consider a data set with N records and J features per record, where N, M are large. Sub-sampling can be used to select some Nxe2x80x2 less than  less than N records, and a preliminary feature selection or transformation can be used to select some Jxe2x80x2 less than  less than J features. This reduced data set with Nxe2x80x2 records and Jxe2x80x2 features is then used to perform the required statistical modeling in a computationally tractable way. However, sub-sampling is unsatisfactory for heterogenous and high-dimensional data sets, and the resulting model estimates will have a high degree of variability with respect to the sampling. Similarly, the preliminary feature selection or transformation is unsatisfactory since it may inadvertently exclude or de-emphasize the original features that are crucial for the accuracy of the eventual statistical models.
Other techniques have recently emerged for scaling up statistical modeling algorithms to massive data sets. For example, G. Graefe, U. Fayyad and S. Chaudhuri, xe2x80x9cOn the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databasesxe2x80x9d, Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 204-208, (1998), and A. W. Moore and M. S. Lee, xe2x80x9cCached Sufficient Statistics for Efficient Machine Learning with Large Datasetsxe2x80x9d, Journal of Artificial Intelligence Research, p. 8 (1998), describe efficient methods for evaluating the sufficient statistics for specific statistical modeling algorithms, which allows the modeling computations to be carried without having to refer back to the original data. These approaches work well if the statistical modeling algorithm is fixed in advance, and if it has a a finite set of sufficient statistics that can be easily obtained from the data. However, one or the other of these assumptions may not be true in many predictive modeling situations.
Other approaches to this problem include P. S. Bradley, U. Payyad and C. Reina, xe2x80x9cScaling Clustering Algorithms to Large Databasesxe2x80x9d, Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 9-15, (1998), and W. DuMouchel, C. Volinsky, T.Johnson, C. Cortes and D. Pregibon, xe2x80x9cSquashing Flat Files Flatterxe2x80x9d, Proc. of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 6-15, (1999) who describe methods for producing compressed data sets consisting of a small number of pseudo data points and associated weights. These compressed data sets are used to evaluate statistical models that are more accurate and less variable than that from an equivalent-sized, randomly sub-sampled subset of the original data. However, these approaches require that existing predictive modeling algorithms and computer codes be capable of handling weights for the data points, which is a limitation.
According to our invention, we provide a heuristic approach that leads to a computationally efficient way for statistical predictive modeling from massive data sets.
This approach is based on constructing a simple probability model from the data, and using this model to generate simulated pseudo data that can be used for some aspects of the statistical modeling procedure. For example, this could be used for preliminary model selection or model parameter estimation. The preliminary model can then be refined in a final step by reverting to the original training data. This reduces the number of data scans over the massive training data set, and thereby reduces the computational cost of the overall statistical modeling procedure.
In order to illustrate this approach, we consider the specific problem of a Naive Bayes model with feature selection. This is a widely used predictive modeling technique in the machine learning community for multi-variate categorical response variables. A straightforward feature selection algorithm would require O(J2) data scans over the training data in order to find the Naive Bayes model with the optimum feature subset. However, our heuristic approach uses just two data scans to obtain models of comparable accuracy.