Increasingly, businesses are put in the unique position of having to effectively deal with dynamic customer data use rights in the development and maintenance of predictive or decision models. Whether as a result of the loss of data license rights, withdrawal of customer consent, or other events that affect the composition or availability of customer data used in the development of these models, there is a need to account for such events in a way that is efficient and statistically valid, without resorting to automatic model retraining software. A rush to automatic model retraining software in such situations would only erode an ability to build good models and would have a direct economic impact on companies that depend on models derived on customer data to provide these services.
Two models can be defined as equivalent if they are trained on the same developmental data and their model architecture and model parameters are identical. A model is said to be resilient if minor changes in the modeling data membership does not lead to changes in the model architecture and parameters. If a model remains resilient after prescribed changes in the developmental sample, the original model is said to remain valid. If a data point is removed from the developmental sample data of the model, the resultant model would almost always have changes in the model parameters, however small. In such cases, the original model may cease to be valid. When a customer's data is used to build a model, and the customer's data is subsequently removed from the developmental sample, the only way to ensure the validity of the model after removing a data point is by ensuring that the developmental data remains unchanged.
While building models, usually a subset of available data is used for modeling through a sampling process. This presents an opportunity to remove the data-point of a withdrawn customer data record from the developmental dataset, and replace it with the data point of another customer's data record which has identical distribution of values but was not sampled earlier for the developmental dataset. We call such customers “surrogates” of the original customer. This approach allows the developmental data to remain unchanged, thus ensuring the model remains valid. All that changes is the membership for the coverage of the data records in the modeling data set. Coverage of a data record is the number of surrogates it has. It should be noted that a given customer might have more than one data records.
The challenge faced often in identifying a surrogate arises from the fact that the data has many real valued features. Due to the real valued nature, such features together have an infinite number of unique combinations. This implies that for a given customer, it would be impossible to find an identical customer who has the exact same values for each of the features. This necessitates developing a discretization of the feature space to ensure proper coverage.