The present invention relates generally to the field of machine learning, and more particularly to generating training data for classifiers and other machine learning modules.
In recent years, machine learning has come to dominate numerous fields of computer problem solving. In particular, machine learning modules have been successful in gleaning knowledge from voluminous unstructured natural language data, such as rich corpora of scientific, technical, and medical texts. Such rich data is not always available. For example, in such fields as medical diagnosis and financial fraud detection, patient records and customer financial records may be highly restricted, both in terms of access to the data and permissible uses of the data. That is, data science providers may have no ability to accesses sufficient data to train classifiers or other machine learning models, or, if they do, the authorized uses of the data may be insufficient for training classifiers well. Accordingly, data scientists continue to face challenges in obtaining sufficient training corpora for machine learning products of all kinds.