Classification systems may sort or categorize a variety of types of documents, files, messages, and other portions of data. For example, a classifier may distinguish between malicious and non-malicious files, archive documents within a database, or sort emails received by a messaging service. Many classification systems analyze documents using a semi-autonomous model or algorithm (e.g., a machine learning algorithm) that compares features of the documents with typical and/or representative traits of various document classes. Such algorithms may identify distinctive traits of different classes by examining a large set of example documents (e.g., a corpus).
Classification systems generally require copious numbers (e.g., thousands, if not millions) of training documents to learn how to accurately and efficiently classify documents. Unfortunately, traditional methods for gathering corpora used by classification algorithms may require extensive time and/or resources. For example, a conventional classifier may depend on human experts to hand-select each document used to train a machine learning algorithm. This process may require thousands of hours of work and is often cost prohibitive for many classification services. The instant disclosure, therefore, identifies and addresses a need for systems and methods for generating training documents used by classification algorithms.