1. Technical Field
The disclosed subject matter relates generally to document classification systems and, more particularly, to maintaining a representative data set in a document classification system.
2. Discussion of the Related Art
A document classification system comprises a knowledge base (KB) that can be trained to classify documents into categories, based on information included in a representative data set (RDS). When a document is to be classified, a statistical analysis of the document is performed and, based on the information in the KB, a classification is determined as the best category match for the target document. The RDS may not contain enough information, or over time its data may become outdated and hence the classification system may not be as accurate as desired, if the data in the RDS is no longer a true representative of the different document classes.
A common practice in example-based classification is to train a KB from scratch at initialization and to also periodically retrain the KB. This practice gives high accuracy but, as mentioned, requires periodic maintenance by a human operator as well as keeping a large set of training documents available. A second common practice is to add incremental feedbacks to an existing KB. This second approach is convenient from the maintenance perspective, but requires great care to avoid bias that degrades the KB.