The following relates generally to methods, and apparatus therefor, for categorizing documents, and more particularly to a method and apparatus for explaining categorization decisions.
Categorizers, such as statistical categorizers, may be deployed with relative ease if the appropriate amount of pre-processed data exists. For example, learning model parameters of a statistical categorizer on a new collection of several thousand documents may take on the order of minutes to hours depending on the amount of pre-processing required. Such preprocessing may, for example, include any required document transformations and/or linguistic processing and/or annotation associated therewith.
Examples of statistical categorizers include Naïve Bayes, Probabilistic Latent Categorization (PLC), and Support Vector Machines (SVM). The use of Naïve Bayes, PLC, and SVM, for categorization is disclosed in the following publications respectively (each of which is incorporated herein by reference) by: Sahami et al., in a publication entitled “A Bayesian approach to filtering spam e-mail, Learning for Text Categorization”, published in Papers from the 1998 AAAI Workshop; Gaussier et al. in a publication entitled “A Hierarchical Model For Clustering And Categorizing Documents”, published in F. Crestani, M. Girolami and C. J. van Rijsbergen (eds), Advances in Information Retrieval—Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Lecture Notes in Computer Science 2291, Springer, pp. 229-247, 2002; and Drucker et al., in a publication entitled “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural Networks, 10:5(1048-1054), 1999.
A problem that exists when using statistical and other categorizers is that the one or more features (e.g., words) that influence a categorization decision are often difficult to understand (i.e., assess why the categorizer selected one particular class instead of another in categorizing a document). More specifically, before a categorization decision is made, a document is initially decomposed into a set of features (or terms). The set of features is input into the categorizer and a score is output for each category (or class), with the highest scored category being the categorization decision. Exactly which ones of the features in the set of features influenced the computation of each score is difficult to assess as each score may be influenced by thousands of different features.
Known methods for solving this problem rely on either non-parametric or model-based approaches to identify the most relevant features. Examples of such approaches are described by Goutte et al., in “Corpus-Based vs. Model-Based Selection of Relevant Features”, Proceedings of CORIA04, Toulouse, France, March, 2004, pp. 75-88, and Dobrokhotov et al., in “Combining NLP and Probabilistic Categorisation for Document and term Selection for Swiss-Prot Medical Annotation”, Proceedings of ISMB-03, Bioinformatics. vol. 19, Suppl 1, pages 191-194, 2003, both of which are incorporated herein by reference. However, such known methods tend to be unable to explain a particular categorization decision (i.e., identify those features used to make a categorization decision), and tend to lead to the paradoxical conclusion that the importance of a feature in categorizing a document is independent of its frequency. Accordingly it would be advantageous to provide an improved method for more accurately assessing which features are relied on by a categorizer when computing the score for each category in assessing the categorization decision.
In accordance with the various embodiments, there is provided a method, and apparatus, for explaining a categorization decision. In the method, a score is computed for each of a plurality of classes using model parameter values of a selected categorization model and feature values of a document to be categorized. In addition in the method, a contribution is computed for each of a plurality of features in the document, where the contribution of a selected feature is computed using model-independent, model-dependent, or document-dependent feature selection. A categorization decision is specified using at least the class with the highest score, and an importance is assigned to each of a plurality of the features using the computed contributions to identify for different ones of the plurality of features, their influence on the categorization decision.