Training artificial intelligence systems can require substantial amounts of training data. Furthermore, when used with data dissimilar from the training data, artificial intelligence systems may perform poorly. These characteristics can create problems for developers of artificial intelligence applications designed to operate on sensitive data, such as customer financial records or patient healthcare data. Regulations governing the storage, transmission, and distribution of such data can inhibit application development, by forcing the development environment to comply with these burdensome regulations.
Furthermore, synthetic data can be generally useful for testing applications and systems. However, existing methods of creating synthetic data can be extremely slow and error-prone. For example, attempts to automatically desensitize data using regular expressions or similar methods requires substantial expertise and can fail when sensitive data is present in unanticipated formats or locations. Manual attempts to desensitize data can fall victim to human error. Neither approach will create synthetic data having statistical characteristics similar to those of the original data, limiting the utility of such data for training and testing purposes.
Accordingly, a need exists for systems and methods of creating synthetic data similar to existing datasets.