1. Field of the Invention
The present invention generally relates to the field of cost-sensitive learning in the areas of machine learning and data mining and, more particularly, to methods for solving multi-class cost-sensitive learning problems using a binary classification algorithm. This algorithm is based on techniques of data space expansion and gradient boosting with stochastic ensembles.
2. Background Description
Classification in the presence of varying costs associated with different types of misclassification is important for practical applications, including many data mining applications, such as targeted marketing, fraud and intrusion detection, among others. Classification is often idealized as a problem where every example is equally important, and the cost of misclassification is always the same. The real world is messier. Typically, some examples are much more important than others, and the cost of misclassifying in one way differs from the cost of misclassifying in another way. A body of work on this subject has become known as cost-sensitive learning, in the areas of machine learning and data mining.
Research in cost-sensitive learning falls into three main categories. The first category is concerned with making particular classifier learners cost-sensitive, including methods specific for decision trees (see, for example, U. Knoll, G. Nakhaeizadeh, and B. Tausend, “Cost-sensitive pruning of decision trees”, Proceedings of the Eight European Conference on Machine Learning, pp. 383-386, 1994, and J. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. Brodley, “Pruning decision trees with misclassification costs”, Proceedings of the European Conference on Machine Learning, pp. 131-136, 1998), neural networks (see, for example, P. Geibel and F. Wysotzki, “Perceptron based learning with example dependent and noisy costs”, Proceedings of the Twentieth International Conference on Machine Learning, 2003), and support vector machines (see, for example, G. Fumera and F. Roli, “Cost-sensitive learning in support vector machines”, VIII Convegno Associazione Italiana per L'Intelligenza Artificiale, 2002). The second category uses Bayes risk theory to assign each example to its lowest expected cost class (see, for example, P. Domingos, “MetaCost: A general method for making classifiers cost sensitive”, Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 144-164, ACM Press, 1999, and D. Margineantu, Methods for Cost-Sensitive Learning, PhD thesis, Department of Computer Science, Oregon State University, Corvallis, 2001). This requires classifiers to output class membership probabilities and sometimes requires estimating costs (see, B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabilities are both unknown”, Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp. 204-213, ACM Press, 2001) (when the costs are unknown at classification time). The third category concerns methods that modify the distribution of training examples before applying the classifier learning method, so that the classifier learned from the modified distribution is cost-sensitive. We call this approach cost-sensitive learning by example weighting. Work in this area includes stratification methods (see, for example, P. Chan and S. Stolfo, “Toward scalable learning with non-uniform class and cost distributions”, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168, 1998, and L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone, Classification and Regression Trees, Wadsworth International Group, 1984) and the costing algorithm (see, for example, B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting”, Proceedings of the Third IEEE International Conference on Data Mining, pp. 435-442, 2003). This approach is very general since it reuses arbitrary classifier learners and does not require accurate class probability estimates from the classifier. Empirically this approach attains similar or better cost-minimization performance.
Unfortunately, current methods in this category suffer from a major limitation: they are well-understood only for two-class problems. In the two-class case, it is easy to show that each example should be weighted proportionally to the difference in cost between predicting correctly or incorrectly (see, again, Zadrozny et al., ibid.). However, in the multi-class case there is more than one way in which a classifier can make a mistake, breaking the application of this simple formula. Heuristics, such as weighting examples by the average misclassification cost, have been proposed (see, again, Breiman et al., ibid., and the Margineantu thesis, ibid.), but they are not well-motivated theoretically and do not seem to work very well in practice when compared to methods that use Bayes risk minimization (see, again, Domingos, ibid.).