This invention relates to a method for constructing predictive models that can be used to make predictions in situations where the inputs to those models can have values that are missing or are otherwise unknown.
Thd invention considers a widely applicable method of constructing predictive models that are capable of generating reliable predictions even when the values of some model inputs are missing or are otherwise unknown. In this regard, it has been discerned that constructing such models is an important problem in many industries that employ predictive modeling in their operations. For example, predictive models are often used for direct-mail targeted-marketing purposes in industries that sell directly to consumers. In this application, predictive models are used to optimize return on marketing investment by ranking consumers according to their predicted responses to promotions, and then mailing promotional materials only to those consumers who are most likely to respond and generate revenue. Such predictive models typically employ demographic, credit, and other data as inputs, and these data often contain many missing values. Generating predictions with greater reliability despite the presence of missing values can lead to better returns on marketing investments for this application. Similar economic benefits can likewise be expected in other commercial applications of predictive modeling.
It has also been also discerned that numerous deficiencies exist in the prior art on how to handle missing values. With regard to constructing predictive models on the basis of training data, the prior art on handling missing values can be roughly divided into six categories (not mutually exclusive):
1) METHODS THAT IGNORE TRAINING CASES THAT CONTAIN MISSING VALUES. This approach is simple and straightforward to mechanize, but it can produce models that generate unreliable predictions when the proportion of missing values is high (see, for example, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Chapman and Hall, 1993; R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley and Sons, 1987; J. R. Quinlan, xe2x80x9cUnknown attribute values in induction,xe2x80x9d Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann, 1989; and M. Singh, xe2x80x9cLearning Bayesian networks from incomplete data,xe2x80x9d Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp 534-539, American Association for Artificial Intelligence, 1997).
2) METHODS THAT IGNORE DATA FIELDS THAT CONTAIN MISSING VALUES. Although rarely discussed in the literature, this approach is often employed in practice by data analysts, particularly in combination with the first approach. When the two approaches are combined, combinations of cases and data fields are removed from the training data until all remaining data fields and training cases contain known data values. The problem with ignoring data fields, however, is that it throws away potentially useful information that might have yielded more accurate models had alternative methods of handling missing values been employed.
3) METHODS THAT INTRODUCE xe2x80x9cMISSINGxe2x80x9d AS A LEGITIMATE DATA VALUE. This approach is valid only when missing values convey information. For example, if the date of last pregnancy is missing from a patient""s medical record, then it is likely that the patient either is male and is unable to become pregnant, or the patient is female and has never been pregnant. However, when values are missing for random reasons, the fact that they are missing conveys no information about the true data values. In such instances, treating xe2x80x9cmissingxe2x80x9d as a legitimate data value can produce inferior models compared to other approaches to handling missing values (see, for example, J. R. Quinlan, xe2x80x9cUnknown attribute values in induction,xe2x80x9d Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann, 1989). The reason for the inferior performance seems to stem from the fact that treating missing as a legitimate value in this case does not adequately take into account the fact that there actually should be a value but that value is not known (see, for example, M. Singh, xe2x80x9cLearning Bayesian networks from incomplete data,xe2x80x9d Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp 534-539, American Association for Artificial Intelligence, 1997). In summary, when missing values convey information, it is reasonable to introduce xe2x80x9cmissingxe2x80x9d as a legitimate value. When missing values convey no information, some other approach to handling these missing values should be employed.
4) METHODS THAT FILL-IN MISSING VALUES VIA IMPUTATION PROCEDURES. This approach involves replacing missing values by estimated values and then employing model-construction methods that assume that all data values are known (see, for example, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Chapman and Hall, 1993; R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley and Sons, 1987; J. R. Quinlan, xe2x80x9cUnknown attribute values in induction,xe2x80x9d Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann, 1989; and M. Singh, xe2x80x9cLearning Bayesian networks from incomplete data,xe2x80x9d Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp 534-539, American Association for Artificial Intelligence, 1997). The replacement can be performed once (i.e., single imputation) or several times (i.e., multiple imputation). Multiple imputation generally produces better results than single imputation (see, for example, R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley and Sons, 1987). However, in order to estimate missing values in the first place, one must construct models for those missing values. Because some models can be more accurate than others, the quality of the predictive model constructed from filled-in values is ultimately dependent on the quality of the missing-value models used to calculate those filled-in values. Moreover, constructing accurate missing-value models can itself be problematic, requiring invention to solve.
5) METHODS THAT EMPLOY WEIGHTING SCHEMES IN THE CALCULATION OF MODEL PARAMETERS IN AN ATTEMPT TO COMPENSATE FOR THE PRESENCE OF MISSING DATA. This approach is common in the analysis of survey data wherein people who are surveyed can choose not to respond to some or all of the survey questions. Adjustments are therefore made in the analysis of the results to compensate for nonresponses by introducing weighting factors in the calculations performed on the known responses (see, for example, R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley and Sons, 1987). The calculation of weights is based on assumed models for the occurrences of nonresponses. Inaccuracies in these models therefore produce inaccuracies in the analysis of the data.
Weighting schemes are also employed in some classification and regression tree algorithms (see, for example, J. R. Quinlan, xe2x80x9cUnknown attribute values in induction,xe2x80x9d Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann, 1989). However, these weighting schemes are actually mathematically equivalent to performing multiple imputations with extremely large numbers of replacements. In essence, the weights correspond to probabilities in statistical models that are constructed for the missing values as part of the tree-building process. Instead of actually performing imputations and constructing trees from filled-in data, it is computationally more efficient to modify the tree-construction algorithms to employ weights that are calculated from the missing-value models. Because these weighting schemes can be derived from imputation procedures, they suffer the same drawbacks as do imputation procedures.
6) METHODS THAT INTRODUCE FREE PARAMETERS INTO THE MODEL THAT REPRESENT THE MISSING DATA AND THAT THEN ESTIMATE THESE PARAMETERS BASED ON THE DATA VALUES THAT ARE KNOWN. The Expectation Maximization (EM) algorithm (see, for example, A. Dempster, N. Laird, and D. Rubin, xe2x80x9cMaximum likelihood from incomplete data via the EM algorithm,xe2x80x9d Journal of the Royal Statistical Society, Series B, Vol. 39, pp. 1-38, 1977; and R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley and Sons, 1987) is the principal method that is employed for estimating the missing-data parameters as an integral part of the model-construction process. However, the so-called EM algorithm is actually a system of generalized mathematical equations that must be given specific forms for specific applications. In addition, considerable skill and ingenuity is often required to reduce the equations for specific applications into sequences of calculations that can then be mechanically realized. Thus, while the EM algorithm is very general and has many advantages, it often requires invention to apply the EM algorithm in practice.
In summary, the first three approaches to handling missing values that are listed above are straightforward and can be readily applied. However, each has its individual disadvantages. The last three approaches are more sophisticated and avoid some of the deficiencies of the first three approaches; however, they often require skill, ingenuity, and/or invention to be applied in practice.
As indicated above, we have discerned that the prior art methods of handling missing values in predictive models the invention has deficiencies that either result in the construction of models that generate poor predictions relative to other approaches, or that prevent the methods from being readily mechanized and applied in practice.
In sharp contrast, the present invention has now discovered a methodology for handling missing values that can be readily mechanized in a widely applicable fashion, and that yields models that produce reliable predictions relative to other model-construction methods. In its generalized expression, the method comprises a program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for constructing predictive models that can be used to make predictions even when the values of some or all inputs are missing or are otherwise unknown, the method steps comprising:
1) presenting a collection of training data comprising examples of input values that are available to the model together with the corresponding desired output value(s) that the model is intended to predict; and
2) generating a plurality of subordinate models, that together comprise an overall model, in such a way that:
a) each subordinate model has an associated set of application conditions that must be satisfied in order to apply the subordinate model when making predictions, the application conditions comprising
i) tests for missing values for all, some, or none of the inputs, and
ii) tests on the values of all, some, or none of the inputs that are applicable when the values of the inputs mentioned in the tests have known values; and
b) for at least one subordinate model, the training cases used in the construction of that subordinate model include some cases that indirectly satisfy the application conditions in the sense that the application conditions are satisfied only after replacing one or more known data values in these training cases with missing values.
In its generalized expression, the novel method can realize significant advantages because it can be readily applied in conjunction with any method for constructing models, including ones that require all input values to be known, thereby yielding combined methods for constructing models that tolerate missing values.
In a particularized expression, for example, the novel method can be used in combination with classification and regression trees, classification and regression rules, or stepwise regression. The novel method thus has great general utility and can be used to solve prediction problems in numerous applications involving data with missing values.