Document classification typically aims to classify textual documents automatically, based on the words, phrases, and word combinations therein (hereafter, “words”). Business applications of document classification have seen increasing interest, especially with the introduction of low-cost microoutsourcing systems for annotating training corpora. Prevalent applications include, for example, sentiment analysis (e.g., Pang, B., L. Lee. 2008, “Opinion mining and sentiment analysis”, Foundations and Trends in Information Retrieval 2(1-2) 1-135), patent classification, spam identification (e.g., Attenberg, J., K. Q. Weinberger, A. Smola, A. Dasgupta, M. Zinkevich, 2009, “Collaborative email-spam filtering with the hashing-trick” Sixth Conference on Email and Anti-Spam (CEAS)), news article annotation (e.g., Paaβ, G., H. de Vries, 2005, “Evaluating the performance of text mining systems on real-world press archives”), email classification for legal discovery, and web page classification (e.g., Qi, X., B. D. Davison, 2009, “Web page classification: Features and algorithms”, ACM Computing Surveys (CSUR) 41(2) 1-31). Classification models can be built from labeled data sets that encode the frequencies of the words in the documents.
Data-driven text document classification has widespread applications, such as the categorization of web pages and emails, sentiment analysis, and more. Document data are characterized by high dimensionality, with as many variables as there exist words and phrases in the vocabulary—often tens of thousands to millions. Many business applications can utilize human understanding of the reasons for classification decisions, by managers, client-facing employees, and the technical team. Unfortunately, because of the high dimensionality, understanding the decisions made by the document classifiers can be difficult. Previous approaches to gain insight into black-box models typically have difficulty dealing with high-dimensional data.
Further, organizations often desire to understand the exact reasons why classification models make particular decisions. The desire comes from various perspectives, including those of managers, customer-facing employees, and the technical team. Customer-facing employees often deal with customer queries regarding the decisions that are made; it often is insufficient to answer that the magic box said so. Managers may need to “sign off” on models being placed into production, and may prefer to understand how the model makes its decisions, rather than just to trust the technical team or data science team. Different applications have different degrees of need for explanations to customers, with denying credit or blocking advertisements being at one extreme. However, even in applications for which black-box systems are deployed routinely, such as fraud detection (Fawcett, T., F. Provost, 1997, “Adaptive fraud detection”, Data Mining and Knowledge Discovery 1(3) 291-316.), managers still typically need to have confidence in the operation of the system and may need to understand the reasons for particular classifications when errors are made. Managers may also need to understand specific decisions when they are called into question by customers or business-side employees. Additionally, the technical/data science personnel themselves should understand the reasons for decisions in order to be able to debug and improve the models. Holistic views of a model and aggregate statistics across a “test set” may not give sufficient guidance as to how the model can be improved. Despite the stated goals of early research on data mining and knowledge discovery (Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth, 1996, “From data mining to knowledge discovery: An overview”, Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, 1-34), very little work has addressed support for the process of building acceptable models, especially in business situations where various parties must be satisfied with the results.
Popular techniques to build document classification models include, for example, naive Bayes, linear and non-linear support vector machines (SVMs), classification-tree based methods (often used in ensembles, such as with boosting (Schapire, Robert E., Yoram Singer, 2000, “Boostexter: A boosting-based system for text categorization”, Machine Learning 39(2/3) 135-168), and many others (e.g., Hotho, A., A. Nürnberger, G. Paass, 2005, “A brief survey of text mining”, LDV Forum 20(1) 19-62). Because of the massive dimensionality, even for linear and tree-based models, it can be very difficult to understand exactly how a given model classifies documents. It is essentially impossible for a non-linear SVM or an ensemble of trees.
Several existing methods for explaining individual classifications and reasons why they are not ideal or suitable for explaining document classifications have been described. An approach to explain classifications of individual instances that can be applicable to any classification model was presented by Robnik-Sikonja, M., I. Kononenko, 2008, “Explaining classifications for individual instances”, IEEE Transactions on Knowledge and Data Engineering 20 589-600. This publication describes a methodology to assign scores to each of the variables that indicate to what extent they influence the data instance's classification. As such, they define an explanation as a real-valued vector e that denotes the contribution of each variable to the classification of the considered data instance x0 by classification model M (see Definition 2 herein). The effect of each attribute of a test instance x0 is measured by comparing the predicted output f(x0) with f(x0\Ai), where x0\Ai stands for the instance without any knowledge about attribute Ai. This is implemented by replacing the actual value of Ai with all possible values for Ai and weighting each prediction by the prior probability of that value. For continuous variables, a discretization method is applied to the variable. The larger the change in predicted output, the larger the contribution of the attribute. This change in output can be measured in various ways, using simply the difference in probabilities, the information difference or the weight of evidence. The contributions provided by the previously discussed technique are very similar to the weights in a linear model, which also denote the relative importance of each variable.
Definition 2. Robnik-Sikonja, M., I. Kononenko, 2008, “Explaining classifications for individual instances”, IEEE Transactions on Knowledge and Data Engineering 20 589-600 define an explanation of the classification of model M for data instance x0 as an m dimensional real-valued vector:ERS(M,x0)=eεRm, with ei=f(x0)−f(x0\Ai), i=1,2, . . . ,m The explanation of each attribute can be visualized, graphically showing the magnitude and direction of the contribution of each variable. A simple example is given for the Titanic data set where the aim is to predict whether a Titanic passenger survived. The instance with a female, adult, third-class passenger that is classified as surviving is explained by the contributions below. The fact that the passenger is female is the main contributor for the prediction, as the contributions for age and class are very small and even in the opposite direction.                class=third, contribution=−0.344        age=adult, contribution=−0.034        gender=female, contribution=1.194        
This basic approach is not able to detect the case where a change in more than one variable is needed in order to obtain a change in predicted value. Strumbelj, E., I. Kononenko, M. Robnik-Sikonja, 2009, “Explaining instance classifications with interactions of subsets of feature values”, Data & Knowledge Engineering 68(10) 886-904 build further on this and proposes an Interactions-based Method for Explanation (IME) that can detect the contribution of combinations of feature values. The explanation once again is defined as a real-valued m-dimensional vector denoting variable contributions. First, a real value number is assigned to each subset of the power set of feature values. These changes are subsequently combined to form a contribution for each of the individual feature values. In order to assess the output of the model with a subset of variables, instead of weighting over all permutations of the features values, a model is built using only the variables in the subset. Although the results are interesting, they used data sets of dimensions maximal 13.
There are several drawbacks of this method. First, the time complexity scales exponentially with the number of variables. They report that 241 seconds are needed to explain the classification of 100 test instances for the random forests model for the highest dimensional data sets (breast cancer Ijubljana which has 13 features). The authors recognize the need for an approximation method. Second, the explanation is typically not very understandable (by humans), as the explanation is once again a real-valued number for each feature, which denotes to what extend it contributes to the class. They verify their explanations with an expert, where the expert needs to assess whether he or she agrees with the magnitude and direction of the contribution of each feature value.
A game-theoretical perspective of their method is provided by Strumbelj, E., I. Kononenko, 2010, “An efficient explanation of individual classifications using game theory”, Journal of Machine Learning Research 11 1-18, as well as a sampling-based approximation that does not require retraining the model. On low dimensional data sets they provide results very quickly (in the order of seconds). For the data set with most features, arrhythmia (279 features), they report that it takes more than an hour to generate an explanation for a prediction of the linear Naive Bayes model. They state: The explanation method is therefore less appropriate for explaining models which are built on several hundred features or more. Arguably, providing a comprehensible explanation involving a hundred or more features is a problem in its own right and even inherently transparent models become less comprehensible with such a large number of features. Stated within a safe advertising application: a vector of thousands of values does not provide an answer to the question ‘Why is this web page classified as containing adult content?’ This approach therefore is not suitable for document classification.
Baehrens, David, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller, 2010, “How to explain individual classification decisions”, Journal of Machine Learning Research 11 1803-1831—also defines an instance level explanation as a real-valued vector. In this case however, the vector denotes the gradient of the classification probability output in the test instance to explain, and as such defines a vector field indicating where the other classification can be found.
Definition 3. Baehrens, David, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller, 2010, “How to explain individual classification decisions”, Journal of Machine Learning Research 11 1803-1831 define an explanation of the classification of model M for data instance x0 as an m dimensional real-valued vector, obtained as the gradient of the class probability in the instance:EB(M,x0)=eεRm, with ei=∇p(x)|x=x0, i=1,2, . . . ,m For SVMs it uses an approximation function (through Parzen windowing) in order to calculate the gradient. In our document classification setup, this methodology in itself does not provide an explanation in the form that is wanted as it simply will give the direction of steepest descent towards the other class. It could however serve as a basis for a heuristic explanation algorithm to guide the search towards those regions where the change in class output is the largest. The exact stepsize and the minimal set of explaining dimensions (words) still need to be determined within such an approach.
Inverse Classification. Sensitivity analysis is the study of how input changes influence the change in the output, and can be summarized by Eq. (4).f(x+Δx)=f(x)+Δf  (4)Inverse classification is related to sensitivity analysis and involves “determining the minimum required change to a data point in order to reclassify it as a member of a (different) preferred class” (Mannino, M., M. Koushik, 2000, “The cost-minimizing inverse classification problem: A genetic algorithm approach”, Decision Support Systems 29 283-300). This problem is called the inverse classification problem, since the usual mapping is from a data point to a class, while here it is the other way around. Such information can be very helpful in a variety of domains: companies, and even countries, can determine what macro-economic variables should change so as to obtain a better bond, competitiveness or terrorism rating. Similarly, a financial institution can provide (more) specific reasons why a customer's application was rejected, by simply stating how the customer can change to the good class, e.g. by increasing income by a certain amount. A heuristic, genetic-algorithm based approach can be used in Mannino, M., M. Koushik, 2000, “The cost-minimizing inverse classification problem: A genetic algorithm approach”, Decision Support Systems 29 283-300 that uses a nearest neighbor model.
Classifications made by a SVM model are explained in Barbella, D., S. Benzaid, J. M. Christensen, B. Jackson, X. V. Qin, D. R. Musicant, 2009, “Understanding support vector machine classifications via a recommender system-like approach”, by determining the minimal change in the variables needed in order to achieve a point on the decision boundary. Their approach solves an optimization problem with SVM-specific constraints. A slightly different definition of inverse classification is given in Aggarwal, C. C., C. Chen, J. W. Han, 2010, “The inverse classification problem”, Journal of Computer Science and Technology 25(3) 458-468, which provides values for the undefined variables of a test instance that result in a desired class. Barbella, D., S. Benzaid, J. M. Christensen, B. Jackson, X. V. Qin, D. R. Musicant, 2009, “Understanding support vector machine classifications via a recommender system-like approach”, search for explanations by determining the point on the decision boundary (hence named border classification) for which the Euclidean distance to the data instance to be explained is minimal.
Definition 4. Barbella, D., S. Benzaid, J. M. Christensen, B. Jackson, X. V. Qin, D. R. Musicant, 2009, “Understanding support vector machine classifications via a recommender system-like approach”, implicitly define an explanation of the classification of model M for data instance x0 as the m dimensional real-valued input vector closest to x0, for which the predicted class is different from the predicted class of x0:EIC(M,x0)=eεRm=argminΣj=1:n(ej−x0j)2 and f(e)=0
Since finding the global optimal solution is not feasible, a locally optimal solution is sought. The approach is applied on a medical data set with eight variables. The explanation provided shows a change in all variables. Applying this to document classification is therefore again not useful. The authors describe the appropriateness for low dimensional data only as follows: our approach in the current form is most usable when the number of features of the data set is of a size that the user can eyeball all at once (perhaps 25-30 or so) (Barbella, D., S. Benzaid, J. M. Christensen, B. Jackson, X. V. Qin, D. R. Musicant, 2009, “Understanding support vector machine classifications via a recommender system-like approach”).
Exemplary Explanations and Statistical Classification Models
Explaining the decisions made by intelligent decision systems has received both practical and research attention for years There are certain results from prior work that help to frame, motivate, and explain the specific gap in the current state of the art that this paper addresses. Before delving into the theoretical work, it may be beneficial to clarify the types of systems and explanations that are the focus of this paper.
Exemplary Model-based decision systems and instance-specific explanations
Starting as early as the celebrated MYCIN project in the 1970s studying intelligent systems for infectious disease diagnosis (Buchanan and Shortliffe 1984), the ability for intelligent systems to explain their decisions was understood to be necessary for effective use of such systems, and therefore was studied explicitly. The document Classification systems are an instance of decision systems (DSs)—systems that either (i) support and improve human decision making (as with the characterization of decision-support systems by Arnott, David. 2006. Cognitive biases and decision support systems development: a design science approach. Information Systems Journal 16(1) 55-78), or (ii) make decisions automatically, as with certain systems for credit scoring, fraud detection, targeted marketing, on-line advertising, web search, legal and medical document triage, and a host of other applications. An exemplary application of the exemplary embodiments of the present disclosure falls in the second category: multitude of attempts to place advertisements are made each day, and the decision system needs to make each decision in a couple dozen milliseconds.
Such model-based decision systems have seen a steep increase in development and use over the past two decades (Rajiv D. Banker, Robert J. Kauffman 2004 The Evolution of Research on Information Systems: A Fiftieth-Year Survey of the Literature in Management Science 50 (3) 281-298). Certain models can be of interest that are produced by large-scale automated statistical predictive modeling systems, which Shmueli and Koppius argue should receive more attention in the IS literature, and for which generating explanations can be particular problematic, as such data mining systems can build models using huge vocabularies. See Shmueli, G., O. R. Koppius. 2011. Predictive analytics in information systems research. MIS Quarterly 35(3) 553-572.
Different applications can impose different requirements for understanding. Let's consider three different application scenarios—both to add clarity in what follows, and so that we can rule out one of them. First, in some applications, it can be important to understand every decision that the DS may possibly make. For example, for many applications of credit scoring (Martens, D., B. Baesens, T. Van Gestel, J. Vanthienen. 2007. Comprehensible credit scoring models using rule extraction from support vector machines. Europ. Journal of Operational Research 183(3) 1466-1476) regulatory requirements stipulate that every decision be justifiable, and often this is required in advance of the official “acceptance” and implementation of the system. Similarly, it can be seen that a medical decision system may need to be completely transparent in this respect. The current prevailing interpretation of this requirement for complete transparency argues for a globally comprehensible predictive model. Indeed, in credit scoring generally the only models that are accepted are linear models with a small number of well-understood, intuitive variables. Such models are chosen even when non-linear alternatives are shown to give better predictive performance (Martens et al. 2007).
In contrast, consider applications, where one should explain the specific reasons for some subset of the individual decisions (cf., the theoretical reasons for explanations summarized by Gregor, S., I. Benbasat. 1999. Explanations from intelligent systems: Theoretical foundations and implications for practice. MIS Quarterly 23(4) 497-530, discussed below). Often, this need for individual case explanations can arise because particular decisions need to be justified after the fact, because (for example) a customer questions the decision or a developer is examining model performance on historical cases. Alternatively, a developer may be exploring decision-making performance by giving the system a set of theoretical test cases. In both scenarios, it is necessary for the system to provide explanations for specific individual cases. Individual case-specific explanations may also be sufficient in many certain applications. According to an exemplary embodiment of the present disclosure, it can be interesting that they be necessary. Other examples in the second scenario can include, fraud detection (Fawcett and Provost 1997), many cases of targeted marketing, and all of the document Classification applications listed in the first paragraph of this paper.
In a third exemplary application scenario, every decision that the system actually makes should be understood. This often is the case with a classical decision-support system, where the system is aiding a human decision maker, for example for forecasting (Gonul, M. Sinan, Dilek Onkal, Michael Lawrence. 2006. The effects of structural characteristics of explanations on use of a dss. Decision Support Systems 42 1481-1493) or auditing (Ye, L. R., P. E. Johnson. 1995. The impact of explanation facilities on user acceptance of expert systems advice. MIS Quarterly 19 157-172). For such systems, again, it is necessary to have individual case-specific explanations.
Exemplary Cognitive Perspectives on Model Explanations
Gregor and Benbasat (1999) provide a survey of empirical work on explanations from intelligent systems, presenting a unified theory drawing upon a cognitive effort perspective, cognitive learning, and Toulin's model of argumentation. They find that explanations are important to users when there is some specific reason and anticipated benefit, when an anomaly is perceived, or when there is an aim of learning. From the same perspective, an explanation can be given automatically (without any effort from the user to make it appear), and tailored to the specific context of the user, requiring even less cognitive effort as less extraneous information has to be read. According to this publication, explanations complying with these requirements lead to better performance, better user perceptions of the system, and possibly improved learning. Our design provides explanations for particular document Classifications that can be useful precisely for these purposes.
Gregor and Benbasat's theoretical analysis brings to the fore three ideas that can be important. First, they introduce the reasons for explanations: to resolve perceived anomalies, a need to better grasp the inner workings of the intelligent system, or the desire for long-term learning. Second, they describe the type of explanations that should be provided: they emphasize the need not just for general explanations of the model, but for explanations that are context specific. Third, Gregor and Benbasat emphasize the need for “justification”-type explanations, which provide a justification for moving from the grounds to the claims. This is in contrast to rule-trace explanations—traditionally, the presentation of chains of rules, each with a data premise (grounds), certainty factor (qualifier) and conclusion (claim). In statistical predictive modeling, reasoning generally is shallow such that the prediction itself essentially is the rule-trace explanation. Specifically, the “trace” often entails simply the application of a mathematical function to the case data, with the result being a score representing the likelihood of the case belonging to the class of interest—with no justification of why.
There is little existing work on methods for explaining modern statistical models extracted from data that satisfy these latter two criteria, and possibly none that provide such explanations for the very high-dimensional models that are the focus of this paper.
An important subtlety that is not brought out explicitly by Gregor and Benbasat, but which is quite important in our contemporary context is the difference between (i) an explanation as intended to help the user to understand how the world works, and thereby help with acceptance of the system, and (ii) an explanation of how the model works. In the latter case, which is our focus, the explanation thereby either can help with acceptance, or can focus attention on the need for improving the model.
Kayande et al.'s Exemplary 3-Gap Framework
In order to examine more carefully why explanations are needed and their impact on decision model understanding, long-term learning, and improved decision making, it is possible to review a publication by Kayande, U., A. De Bruyn, G. L. Lilien, A. Rangaswamy, G. H. van Bruggen. 2009. How incorporating feedback mechanisms in a DSS affects dss evaluations. Information Systems Research 20 527-546. This work focuses on the same context as we do in our case study, specifically where data are voluminous, the link between decisions and outcomes is probabilistic, and the decisions are repetitive. They presume that it is highly unlikely that decision makers can consistently outperform model-based DSs in such contexts.
Prior work has suggested that when users do not understand the workings of the DS model, they will be very skeptical and reluctant to use the model, even if the model is known to improve decision performance, see e.g., Umanath, N. S., I. Vessey. 1994. Multiattribute data presentation and human judgment: A cognitive fit. Decision Sciences 25(5/6) 795 824, Limayem, M., G. De Sanctis. 2000. Providing decisional guidance for multicriteria decision making in groups. Information Systems Research 11(4) 386-401, Lilien, G. L., A. Rangaswamy, G. H. Van Bruggen, K. Starke. 2004. DSS effectiveness in marketing resource allocation decisions: Reality vs. perception. Information Systems Research 15 216-235, Arnold, V., N. Clark, P. A. Collier, S. A. Leech, S. G. Sutton. 2006. The differential use and effect of knowledge-based system explanations in novice and expert judgement decisions. MIS Quarterly 30(1) 79-97, and Kayande et al. (2009).
Further, decision makers likely need impetus to change their decision strategies (Todd, P. A., I. Benbasat. 1999. Evaluating the impact of dss, cognitive effort, and incentives on strategy selection. Information Systems Research 10(4) 356-374), as well as guidance in making decisions (Mark S. Silver: Decisional Guidance for Computer-Based Decision Support. MIS Quarterly 15(1): 105-122 (1991)). Kayande et al. introduce a “3-gap” framework (see FIG. 1A) for understanding the use of explanations to improve decision making by aligning three different “models”: the user's model 120, the system's model 130, and reality 110. Their results show that guidance toward improved understanding of decisions combined with feedback on the potential improvement achievable by the model induce decision makers to align their mental models more closely with the decision model, leading to deep learning. This alignment reduces the corresponding gap (Gap 1), which in turn improves user evaluations of the DS. It is intuitive to argue that this then improves acceptance and increases use of the system. Under the authors' assumption that the DS's model is objectively better than the decision maker's (large Gap 3 compared to Gap 2), this then would lead to improved decision-making performance, cf., Todd and Benbasat (1999). Expectancy theory suggests that this will lead to higher usage and acceptance of the DS model, as users will be more motivated to actually use the DS if they believe that a greater usage will lead to better performance (De Sanctis 1983).
Accordingly, there may be a need to address and/or overcome at least some of the deficiencies described herein above.