Automated categorization of documents is a powerful tool for developing large document databases for businesses, organizations, and so forth. Typically, automated categorization involves selecting suitable categories and then training the categorizer respective to an initial set of training documents that are pre-annotated with suitable category labels. The training involves analyzing the pre-annotated documents to identify a vocabulary of words that are indicative of the various categories. Once trained, the categorizer can receive a new document, identify the vocabulary words in that new document, and annotate the document with one or more appropriate category labels based on the identified vocabulary words.
For example, one class of probabilistic categorizers are the Naïve Bayes-type categorizers, which employ Bayesian conditional vocabulary word probabilities, assuming statistical independence of the conditional probabilities. Another class of categorizers are the probabilistic latent categorizers (PLC), which are described for example in “Method for Multi-Class, Multi-Label Categorization Using Probabilistic Hierarchical Modeling” (Ser. No. 10/774,966 filed Feb. 9, 2004), and in Eric Gaussier et al., “A hierarchical model for clustering and categorising documents”, in “Advances in Information Retrieval—Proceedings of the 24th BCS-IRSG European Colloquium on IR Research”, vol. 2291 of Lecture Notes in Computer Science, pages 229-47 (Springer, 2002), Fabio Crestani, Mark Girolami, and Cornelis Joost van Rijsbergen, editors. The PLC approach employs a co-occurrence categorization model.
The ability of such probabilistic categorizers to accurately assign category labels to new documents is determined in part by the quality of the training, which in turn depends upon the quantity of the initial training documents as well as upon how representative those documents are of the categories of the categorization system. If, for example, the training documents do not adequately represent certain categories, then the resulting trained categorizer will be less reliable in assigning documents to those categories. Since the initial training documents are pre-annotated, reliability can also be compromised if some of the training documents are improperly categorized.
Moreover, even if the initial collection of training documents is large and adequately representative of the categories of the categorization system, inaccuracies can still arise over time due to drift. For example, as a business develops over time, it may shift its focus from one line of products to another. Similarly, a field of knowledge evolves over time as researchers substantially solve certain problems and move onto new challenges. In contrast, the categorizer is static, being based entirely upon the initial training using the initial collection of documents, and hence does not evolve to track drift in document characteristics over time. Drift can involve changes in the relative frequencies of occurrences of certain vocabulary words in documents of certain categories. Drift can also involve the introduction of entirely new words into the language of documents of certain categories. These new words may be highly indicative of the category, but are not part of the categorization vocabulary since the new words did not exist, or were very infrequently used, at the time of the initial training.
Various approaches have been used to address inadequacies in the initial training and to address drift over time. In some categorization systems, human review of the automated categorization is incorporated, so as to allow manual correction where the trained categorizer erroneously categorizes a document. Such an approach relying upon human intervention is unsatisfactory for maintaining large document databases. Alternatively, the categorizer can be retrained occasionally to account for drift in the documents over time. This approach is computationally intensive, and also requires substantial human intervention since a new collection of training documents must be gathered and pre-annotated.