Predictive modeling is the process by which a model is created or chosen to try to best predict the probability of an outcome. Generally, the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data (for example: given an e-mail determining how likely that it is spam). Thus, given a predefined set of features (indicators) X, predictive modeling aims at predicting the probability P(Y|X) of a specific outcome Y. This task can be seen as a search for a “true” probability distribution P(Y|X), which, however, is not directly observable. Rather, one has to try to generate an optimal distribution which should be chosen in such a way that the risk of false prediction over an underlying distribution P(X) of features X is minimized. As a consequence, achieving good predictions for combinations of features X that appear frequently in the application area should be given high attention, while combinations that are expected to occur very rarely can be ignored.
In reality, neither the “true” probability distribution P(Y|X) nor the “true” distribution of features P(X) are completely known. Rather, they are approximated based on a training set of examples. The underlying assumption is that the “true” distributions P(X), P(Y|X) will behave just like the ones of the training examples. This is, however, often not the case. Moreover, the training set of examples may be noisy; in this case, adapting the model perfectly to the training data would lead to “over-fitting” and would yield a model that does not accurately reflect the “true” distributions P(X), P(Y|X). On the other hand, it may be known beforehand that the “true” distribution of features P(X) differs from the distribution of the training data in an actual application domain. If, for example, a model predicting the spread of a given disease is to be generated, the training data may be erroneous since only a small fraction of the people testing positive may have been identified while it is known that the actual percentage is much higher. In this case, the distribution of the positive samples in the training data does not reflect the “true” distribution of the contagioned people.
If it is known that the training data exhibit a different distribution than the actual real-world data, this knowledge can be used to adapt the process of finding an optimal prediction model. Specifically, algorithms have been developed which are able to take a given distribution P(X) or P(Y) into account and combine it with the training data. The result is a model that obeys the constraints imposed by P(X) or by P(Y) and still approximates the training data as well as possible.
In the case of a given distribution P(Y) of labels Y, this may be achieved using a cost sensitive classifier, as described in US 2008/0065572 A1. Such classifiers are supported by most state-of-the-art predictive analytics tools, such as IBM® SPSS® software or IBM® INFOSPHERE® WAREHOUSE (IBM, INFOSPHERE, and SPSS are trademarks of International Business Machines Corporation in the United States, other countries, or both).
In the case of a given distribution P(X) of indicators X, this may be achieved by rejection sampling or by using example weights, as described in “Cost-Sensitive Learning by Cost-Proportionate Example Weighting”, by B. Zadrozny et al., Proceedings of the Third IEEE International Conference on Data Mining (2003), p. 435 ff. Such methods are only supported for some algorithms on a product level; however, most algorithms can be extended accordingly.
All of these methods assume that the information about the actual “true” distribution P(X) or P(Y) is static and that it is known before the model training process starts. However, it is often desirable to be able to apply a single model to a variety of situations with different underlying “true” distributions P(X). Furthermore, an analyst would often like to interactively explore the consequences of different assumptions about a distribution P(X) of indicators X in terms of a what-if analysis. Using a single, global model in all of these situations would have severe disadvantages:                For one thing, the model would probably not be optimal in the sense of structural risk, as some cases may occur much more often in reality than in the training set and therefore should be given a higher attention than others.        Moreover, the model would probably be quite complex, even though in the application area only a small portion of the model may be relevant.        
These problems could be solved by building a new model for each application area. This approach, however, requires a severe computational effort and, in the majority of cases, involves prohibitively long response times which renders the task non-interactive and does not allow the user to interactively try out different assumptions about the “true” distribution P(X). Also, it poses a security risk organizationally, since everybody who employs the model and adapts it to a new application would need to obtain access to the actual source data.
Thus, there is a need for a predictive modeling method which circumvents these problems.