Computer models have been widely applied in various industries to make predictions to improve business performance or mitigate risk. Such computer models usually produce a score (a numeric value) to predict the probability that certain event will happen. Even though sometimes a model score solely by itself is enough for decision making, reason codes to explain why certain case is assigned with a high score by the model are desirable in certain business practices, including but not limited to credit risk scoring, credit card fraud detection. In the non-limiting example of credit risk scoring, the agency that provides such score to customers is also required to provide the reasons why the score is not higher. In the non-limiting example of fraud detection, when reviewing the cases referred by the model as high risk, analysts need to understand why one transaction gets referred for more targeted and efficient reviews.
Reason codes can be considered as input variables to a model that contribute the highest fraction to the model score being high, or a more descriptive format of the model related to such input variables. Methods to generate reason codes have been put forward previously for logistic regression and neural networks. For logistic regression, reason codes are typically generated by ranking the products of input variables multiplied by their own weights. A model score is produced by summation of such products and then fed into a sigmoid function. Top ranking input variables make bigger contributions to the model score, hence will be the reason codes. Another method was proposed to generate reason codes for credit risk scores, which calculates maximum improvement of the score by changing the value for one variable, which was called “area of improvement.” The variables were then ranked by the “area of improvement”, and the top ranked input variables were the reason codes. Although such method may be applied to logistic regression and neural networks methods. It did not, however, propose clearly how to find the change of input variables to obtain the maximum improvement.
Many industrial applications of computer models require the model to generate reason codes, which are input variables that produce the biggest impact on the score of a model. In recent years, ensemble methods, such as bagging, boosting, random forest, or other methods that combine (for example, by averaging or some sort of weighted summation of) the outputs from multiple models into an ensemble model have gained popularity in industrial applications due to their higher performance in prediction and classification compared with conventional single model application. As models become more complex, examining the structures of ensemble models to generate its reason codes becomes impractical (even when each individual model is simple and easy to obtain reason codes) because such models are usually treated as black box due to their complex nature. Even for like logistic regression or decision trees, combining the reason codes from the individual models inside the ensemble model becomes a challenge. Many organizations opt for simpler models with lower performance just because it is difficult to generate reason codes for ensemble models. It is thus desirable to be able to treat the ensemble model as a black box and effectively apply the ensemble model to generate reason codes under industrial settings.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent upon a reading of the specification and a study of the drawings.