Many algorithms in the field of document transformation are based on learning the statistic regularities of a training set and applying those regularities to unseen documents, for instance to determine the document type. Those learning algorithms typically require a significant amount of training data labeled with the correct decisions. For instance, in the case of automatic document classification, a number of documents for each category would be prepared, so that the algorithm can learn to associate aspects of the documents with their category.
In many situations, verification of the correctness of the training data may be performed to ensure the high quality (and thus success) of the application of learning algorithms. Currently, this verification proceeds manually. :For example, an experienced user who has knowledge of all possible categories may inspect one document at a time and may correct its label if a mistake is present.
All current processes used to create and verify training data are very time- and cost-intensive, usually requiring experts in the subject matter to label examples. Additionally, if a hierarchy of document types is large, correcting the label of an example requires a significant cognitive effort, since the details of sometimes several hundreds of categories need to be recalled. Furthermore, manual labeling and verification usually produce many more training examples than are strictly necessary, since it cannot be determined when the training data is of sufficient quality and quantity for a statistical classifier to operate with sufficient performance.
There is thus a need for addressing these and/or other issues associated with the prior art.