Customer behavior modeling is the creation of a mathematical model to represent the common behaviors observed among particular groups of customers in order to predict how similar customers will behave under similar circumstances. Models are typically based on data mining of customer data, and each model can be designed to answer one or more questions at one or more particular periods in time. For example, a customer model can be used to predict what a particular group of customers will do in response to a particular marketing action. If the model is sound and the marketer follows the recommendations it generated, then the marketer will observe that a majority of the customers in the group respond as predicted by the model.
While behavior modeling is a beneficial tool, access to data can present a significant hurdle in training the model. In particular, models need large datasets in order to be properly trained. Only after a model is properly trained can the model be applied. Previously, models were trained on datasets that include information regarding actual people. These datasets, generally referred to as original datasets, include real information about real people, including biographical, demographic, and even financial information about the people in the dataset. Much of this information can be sensitive information, and even though the data in the original dataset can be anonymized, the use of original datasets has significant privacy implications.
In addition to privacy issues in original datasets, original datasets can suffer from a lack of sufficient samples of data to train a model. Problems associated with a small dataset are numerous, but can include (i) over-fitting, which can be more difficult to avoid, and which can result in overfitting the validation set as well, (ii) outliers, which can become much more dangerous, and (iii) noise.
In contrast to original datasets, synthetic datasets can be generated and used to train a model. Synthetic datasets can be based on the original datasets, and/or can include information that is similar to the original datasets. While it is beneficial to use synthetic datasets to train models, it is possible that a model trained with a synthetic dataset can produce misclassifications. Some systems attempt to address these misclassifications by feeding the same synthetic dataset back into a model being trained (e.g., along with the original dataset), weighting the synthetic dataset differently than the original dataset. However, such techniques can be laborious, are manual processes, and still suffer from misclassifications of the data by the model.
Thus, it may be beneficial to provide an exemplary system, method, and computer-accessible medium for determining misclassifications in models and generating target data for improving model performance which can overcome at least some of the deficiencies described herein above.