Training sets are used in automatic categorization of documents, to establish precision and recall curves and to train automatic categorization engines to categorized documents correctly. Precision and recall curves are standard measures of effective categorization and information retrieval. Precision is a measure of the proportion of documents retrieved that are relevant to the intended result. Recall is a measure of the coverage of a query, for instance the number of documents retrieved that match an intended result, compared to the number of documents available that match the intended result. To construct a training set for automatic categorization, trained professionals exercise nearest neighbor and similarity measure procedures, then use precision and recall curves to set criteria for automatically assigning documents to categories, using the training set to generate the precision and recall curves. The training set typically includes documents with categories that have been editorially established or verified by a human.
Errors in categorization include failure to assign a document to the category in which it belongs and assignment of the document to a category in which it does not belong. One cause of this type of error is so-called inadequate corroborative evidence of the correct categorization of similar documents. In other words, the training set does not include similar enough documents to produce the desired match. An approach to overcoming inadequate corroborative evidence is to add documents to the training set.
Adding documents to or deleting documents from a training set implies generating new precision and recall curves, which are used to retune automatic categorization criteria. One way of updating a training set is to generate category scores for each member of the training set using the same categorization algorithm that is used for automatic assignment of documents that have not been editorially categorized. These scores are stored with an editorial category assignment indictor in persistent storage. Data associated with a score entry includes the document identifier, the category identifier, the category score, and a Boolean value indicating whether the same category was editorially assigned to the document. This data is then used to generate precision and recall curves for each category. The curves are analyzed and thresholds adjusted as appropriate. Once the training set has been retuned, it can be used for categorization of documents.
Updating a large training set to add a few documents, for instance to provide additional evidence supporting a particular categorization, can be time consuming and computationally taxing, when the nearest neighbors and similarity scores are recomputed and category thresholds are adjusted for the entire training set. Therefore, there is an opportunity to improve on training set updating by incremental updating.