Artificial intelligence systems utilized to determine consumer behaviors or purchase recommendations typically require efficient data and model pipelines so that recommendations can be provided quickly and the corresponding models can be retrained quickly. Training artificial intelligence systems can require substantial amounts of training data. Preparing training data for such artificial intelligence systems is time consuming, especially for artificial intelligence systems designed to operate on sensitive data, such as customer financial records or patient healthcare data. Potential sensitive data must be anonymized. Furthermore, regulations governing the storage, transmission, and distribution of such data can inhibit application development, by forcing the development environment to comply with these burdensome regulations.
Synthetic data can be generally useful for testing and training artificial intelligence systems. However, existing methods of creating synthetic data are slow and error-prone. For example, attempts to automatically desensitize data using regular expressions or similar methods requires substantial expertise and can fail when sensitive data is present in unanticipated formats or locations. Manual attempts to desensitize data can fall victim to human error. Neither approach will create synthetic data having statistical characteristics similar to those of the original data, limiting the utility of such data for training and testing purposes.
Accordingly, a need exists for improved systems and methods of creating synthetic data for testing or training artificial intelligence systems.