1. Field of the Invention
The field of the invention relates to organizing content. More specifically, the field of the invention relates to identifying training documents for training a content classifier.
2. Description of the Related Art
Content classifiers classify documents into categories to facilitate locating information. Statistics based content classifiers need to continuously evolve to organize unstructured content into categories such that knowledge is easily found. To organize unstructured data, classification engines apply language analytics in conjunction with taxonomies to build customized knowledge bases that are fine-tuned for a particular group of users. Such knowledge bases store data representing statistical information, which associates unstructured content (e.g., a collection of documents web-pages, email messages, etc.) with categories in a logical and consistent manner. To attain high levels of accuracy and adapt to new concepts, the knowledge bases need to be periodically trained and updated so that over time, accurate and reliable categorization of unstructured data can be achieved.
Accurate and reliable categorization requires the building and maintaining of knowledge bases that correspond to specific fields of endeavor. Typically, to achieve this goal, analysts are hired to create initial knowledge bases with data that is already classified, and to periodically provide feedback to the knowledge bases to add new concepts or to adjust existing statistics to provide better classification. However, this leaves analysts with the daunting task of reviewing large quantities of documents to identify categories required to build the initial knowledge base. Further, the accuracy of deployed knowledge bases is limited to the domain knowledge possessed by the hired analysts and hence the results may be error prone or might not correspond to new concepts.