Training sets are used in automatic categorization of documents, to establish precision and recall curves and to train automatic categorization engines to categorize documents correctly. Precision and recall curves are standard measures of effective categorization and information retrieval. Precision is a measure of the proportion of documents retrieved that are relevant to the intended result. Recall is a measure of the coverage of a query, for instance the number of documents retrieved that match an intended result, compared to the number of documents available that match the intended result. Precision and recall curves generated from a training set are used to set criteria for automatically assigning documents to categories. The training set typically includes documents with categories that have been editorially established or verified by a human.
Errors in categorization include failure to assign a document to the category in which it belongs and assignment of the document to a category in which it does not belong. One cause of failure to assign a document is so-called inadequate corroborative evidence of the correct categorization by sufficiently similar documents. In other words, the training set does not include enough sufficiently similar documents to train a system to produce the desired match. An approach to overcoming inadequate corroborative evidence is to add documents to the training set. One cause of assignment of a document to the wrong category is erroneous categorization examples. Human editors, however careful, can make mistakes when preparing training documents for a categorization model. Sometimes, a document is marked with an incorrect code. Experience shows that many human errors are errors of omission, in which one or more topic codes relevant to the document have been omitted or forgotten. Examples of errors and/or omissions in training set formulation include: (a) the incorrect association between a specific sample document and a category or a controlled vocabulary code, (b) the absence of a correct association between a specific sample document and a category, and (c) the absence of a sufficient number of similar sample documents whose coding would corroborate the coding of a document to be coded.
Categorization-by-example systems tend to propagate, and in some cases amplify, errors in the training set. Accordingly, an opportunity arises to introduce methods and systems to audit training sets and automatically identify sample documents whose coding appears to be inconsistent with other coding of the training set.