Today, there is increasing interest in the use of machine learning for analyzing data. Machine learning refers to the design and development of computer algorithms that allow computers to recognize complex patterns and make intelligent decisions based on empirical data.
Typically, a machine learning system that performs text classification on documents includes a classifier. The classifier is provided training data in which each document is already labeled (e.g., identified) with a correct label or class. The labeled document data is used to train a learning algorithm of the classifier which is then used to label/classify similar documents. The accuracy of the classifier is inextricably dependent upon the quality and quantity of correctly labeled documents included in the training data.
Typically, training data for the classifier is derived from experts that manually assign class labels to documents. Manual assignment, however, inherently exhibits a certain level of inconsistency because experts with varying levels of domain knowledge and experience may interpret the same class differently. In addition, the tedious nature of manual assignment can further aggravate the requirement that large amounts of correctly labeled documents be provided to classifiers in order to generalize well. Furthermore, manual assignment of class labels by experts can be an expensive process.
Accordingly, there is a need for improved systems and techniques for generating training data for classifiers.