Training artificial intelligence systems can require substantial amounts of training data. Furthermore, when used with data dissimilar from the training data, artificial intelligence systems may perform poorly. These characteristics can create problems for developers of artificial intelligence applications designed to operate on sensitive data, such as customer financial records or patient healthcare data. Regulations governing the storage, transmission, and distribution of such data can inhibit application development, by forcing the development environment to comply with these burdensome regulations.
Furthermore, synthetic data can be generally useful for testing applications and systems. Such application and systems may implement one or more models. However, such models perform better when they are based on data similar to the data used to train them. But sensitive data cannot be widely distributed for use in training models, forcing application developers to choose between accuracy and training data security. Existing methods of creating synthetic data can be extremely slow and error-prone. For example, attempts to automatically desensitize data using regular expressions or similar methods requires substantial expertise and can fail when sensitive data is present in unanticipated formats or locations. Manual attempts to desensitize data can fall victim to human error. Neither approach will create synthetic data having statistical characteristics similar to those of the original data, limiting the utility of such data for training and testing purposes.
Moreover, it is known that neural networks are more accurate in processing specific tasks as compared with general tasks. However, training neural networks for specific tasks, such as for parsing specific log files, requires large data sets for each specific task. This is not always practical.
Moreover, training of neural networks for parsing specific log files then requires selection of appropriately trained parsers for incoming log files. Furthermore, new types of log files must be recognized in order to trigger training of a new specific neural network.
Accordingly, a need exists for systems and methods of creating synthetic data similar to existing datasets. Additionally, a need exists for systems and methods of training parsers specific to particular data sets and for appropriate selection and training of the specific parsers.