The following relates to the classification arts, document organization arts, document processing arts, document storage arts, and so forth.
Document classification is useful to facilitate the organization, processing, and/or indexed storage of documents. In a conventional approach, a small “training” set of documents are annotated with classifications assigned manually, and the training set is used to train an automated classifier. Typically, the automated classifier classifies documents based on characteristics of the categories, represented by category profiles suitable for the type of document (e.g., weights associated to terms, phrases, or other features). In the case of classification based on textual content of documents, category profiles such as language models (e.g., a representation of statistical frequency of class-indicative words in documents) are suitably used.
This approach employing a training set can become unwieldy when the set of classes becomes large. The training set should include a representative number of documents for each category. By way of illustrative example, the IPTC taxonomy promulgated by the International Press Telecommunications Council (see, e.g., http://www.iptc.org/, last accessed Jan. 7, 2011) employs 1131 categories. Thus, the training set should include a representative number of training documents manually assigned to each category of this set of 1131 categories. If each category is represented by only ten documents, this entails over 11,000 manual annotations. (Note that for multi-class categorization a single document may be annotated with more than one class). The large number of manual annotations is time consuming, and can lead to human error that compromises the automated classifier performance, especially when the hierarchy is large, complex, and/or encompasses a diverse knowledge base so that manual annotations require the human annotator to have broad knowledge of all aspects of the hierarchy. In hierarchical classification (e.g., the IPTC taxonomy is organized hierarchically in five levels), the human annotator is potentially called upon to decide fine “shades” of content, so as to decide at which hierarchical level a given document should be located. Moreover, in some cases, a labeled training set may be unavailable for a given taxonomy.
The following sets forth improved methods and apparatuses.