A classifier is a classification model which assigns an unclassified instance to a predefined set of classes. The classifier may be induced by using a learning algorithm (also known as an inducer), such as C4.5 [Quinlan, R. (1993). C4.5: “Programs for Machine Learning”. Machine Learning, 235-240.] or SVM [Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992): “A training algorithm for optimal margin classifiers”, 5th Annual ACM (pp. 144-152). ACM Press, Pittsburgh, Pa.]. Ensemble methodology considers combining multiple classifiers to work collectively, in order to compensate each other's weaknesses and to generate better classifications through some kind of fusion strategy.
Meta-learning is a process of learning from learners (also called hereinafter classifiers). The training of a meta-classifier is composed of two or more stages, rather than one stage, as with standard learners. In order to induce a meta classifier, first the base classifiers are trained, and then the Meta classifier is trained. In the prediction phase, base classifiers will output their classifications, and then the Meta-classifier(s) will make the final classification (as a function of the base classifiers).
Stacking is a technique for inducing which classifiers are reliable and which are not. Stacking is usually employed to combine models built by different inducers. The idea is to create a meta-dataset containing a tuple (an ordered set of values) for each tuple in the original dataset. However, instead of using the original input attributes, it uses the classifications predicted by the classifiers as the input attributes. The target attribute remains as in the original training set. A test instance is first classified by each of the base classifiers. These classifications are fed into a meta-level training set from which a meta-classifier is produced.
This classifier (also denoted Meta-classifier) combines the different predictions into a final one. It is recommended that the original dataset should be partitioned into two subsets. The first subset is reserved to form the meta-dataset and the second subset is used to build the base-level classifiers. Consequently, the meta-classifier predications reflect the true performance of base-level learning algorithms. Stacking performance can be improved by using output probabilities for every class label from the base-level classifiers. It has been shown that with stacking, the ensemble performs (at best) comparably to selecting the best classifier from the ensemble by cross validation [Dzeroski S., Zenko B. (2004): “Is Combining Classifiers with Stacking Better than Selecting the Best One?” Machine Learning 54(3), (pp. 255-273).].
StackingC is a Stacking variation. In empirical tests Stacking showed significant performance degradation for multi-class datasets. StackingC was designed to address this problem. In StackingC, each base classifier outputs only one class probability prediction [Seewald A. K. (2003): “Towards understanding stacking—Studies of a general ensemble learning scheme”, PhD-Thesis, T U Wien.]. Each base classifier is trained and tested upon one particular class while stacking output probabilities for all classes and from all component classifiers. FIGS. 1 to 4 show an illustration of Stacking and StackingC on a dataset with three classes (a, b and c), n examples, and N base classifiers. Pi,j,k refers to the probability given by base classifier i for class j on example number k.
FIG. 1 shows an example with three classes (a, b and c), n examples, and N base classifiers. It shows the original training set with its attribute vectors and class values. FIG. 2 shows how a class probability distribution of one sensible classifier may appear. The maximum probabilities are shown in italics and denote the classes which would be predicted for each example. There is one such set of class probability distributions for each base classifier. FIG. 3 shows the Meta training set for Stacking which is used to learn a Meta classifier that predicts the probability that class=a. Pi,j,k denotes the probability given by the base classifier i for class j on example number k. The classes are mapped to an indicator variable such that only class “a” is mapped to 1, and all other classes are mapped to 0. In this example, there are, of course, two other such training sets for class b and c which differ only in the last column and are thus not shown.
FIG. 4 shows the corresponding Meta training set for StackingC which consists only of those columns from the original meta training set which are concerned with class=Cα, i.e., Pi, j, k for all i, j and k. While the Meta training sets for Stacking's Meta classifier differ only in the last attribute (the class indicator variable), those for StackingC have fewer attributes by a factor equal to the number of classes and also have no common attributes. This necessarily leads to more diverse linear models, which [Seewald A. K. (2003): “Towards understanding stacking—Studies of a general ensemble learning scheme”, PhD-Thesis, TU Wien.] believes to be one mechanism by which it outperforms Stacking. Another reason may simply be that with fewer attributes, the learning problem becomes easier to solve, provided only irrelevant information is removed. The dimensionality of the Meta dataset is reduced by a factor equal to the number of classes, which leads to faster learning. In comparison to other ensemble learning methods this improves Stacking's advantage further, making it the most successful system by a variety of measures.
StackingC improves on Stacking in terms of significant accuracy differences, accuracy ratios, and runtime. These improvements are more evident for multi-class datasets and have a tendency to become more pronounced as the number of classes increases. StackingC also resolves the weakness of Stacking in the extension proposed by Ting and Witten in [Ting, K. M., Witten, I. H. (1999): Issues in stacked generalization. Journal of Artificial Intelligence Research 10, pages 271-289.] and offers a balanced performance on two-class and multi-class datasets.
Seewald in [Seewald A. K. (2003): “Towards understanding stacking—Studies of a general ensemble learning scheme, PhD-Thesis, TU Wien.] has shown that all ensemble learning systems, including StackingC [Seewald, A. (2002): “How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness”, Nineteenth International Conference on Machine Learning (pp. 554-561). Sydney: Morgan Kaufmann Publishers.], Grading [Seewald A. K. and J. Fuernkranz. (2001). An Evaluation of Grading Classifiers. Advances in Intelligent Data Analysis: 4th International Conference (pp. 115-124). Berlin/Heidelberg/New York/Tokyo: Springer.] and even Bagging [Breiman, L. (1996). Bagging predictors. Machine Learning, 123-140.] can be simulated by Stacking [Wolpert, D. (1992). Stacked Generalization. Neural Networks5, 241-259. Boser, B E., Guyon, I. M. and Vapnik, V N. (1992) A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory ACM Press, Pittsburgh, Pa., pp. 144-152.]. To do this they give functionally equivalent definitions of most schemes as Meta-classifiers for Stacking. Dzeroski and Zenko in [Dzeroski S., Zenko B. (2004). Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning 54(3), (pp. 255-273).] indicated that the combination of SCANN [Merz C. J, and Murphy P. M., UCI Repository of machine learning databases. Irvine, Calif.: University of California, Department of Information and Computer Science, 1998.], which is a variant of Stacking, and MDT Ting and Witten in [Ting, K. M., Witten, I. H. (1999): Issues in stacked generalization. Journal of Artificial Intelligence Research 10, pages 271-289.] plus selecting the best base classifier using cross validation (SelectBest) seems to perform at about the same level as Stacking with Multi-linear Response (MLR).
Seewald in [Seewald A. K. (2003). Towards understanding stacking—Studies of a general ensemble learning scheme. PhD-Thesis, TU Wien.] presented strong empirical evidence that Stacking in the extension proposed by Ting and Witten in [Ting, K. M., Witten, I. H. (1999): Issues in stacked generalization. Journal of Artificial Intelligence Research 10, pages 271-289.] performs worse on multi-class than on two-class datasets, for all but one meta-learner he investigated. The explanation given was that when the dataset has a higher number of classes, the dimensionality of the meta-level data is proportionally increased. This higher dimensionality makes it harder for meta-learners to induce good models, since there are more features to be considered. The increased dimensionality has two more drawbacks. First, it increases the training time of the Meta classifier; in many inducers this problem is acute. Second, it also increases the amount of memory which is used in the process of training. This may lead to insufficient resources, and therefore may limit the number of training cases (instances) from which an inducer may learn, thus damaging the accuracy of the ensemble.
During the learning phase of StackingC it is essential to use one-against-all class binarization and regression learners for each class model. This class binarization is believed to be a problematic method especially when class distribution is highly non-symmetric. It has been illustrated in [Fürnkranz, J. (2002). Pairwise Classification as an Ensemble Technique. European Conference on Machine Learning (pp. 97-110). Helsinki, Finland: Austrian Research Institute for Artificial Intelligence.] that handling many classes is a major problem for the one-against-all binarization technique, possibly because the resulting binary learning problems increasingly skewed class distributions. An alternative to one-against-all class binarization is the one-against-one binarization in which the basic idea is to convert a multiple class problem into a series of two-class problems by training one classifier for each pair of classes, using only training examples of these two classes and ignoring all others. A new example is classified by submitting it to each of the
      k    ⁡          (              k        -        1            )        2binary classifiers, and combining their predictions (k, number of classes in the multiple class problem). We have found in our preliminary experiments that this binarization method yields noticeably poor accuracy results when the number of classes in the problem increases. Later, after performing a much wider and broader experiment on StackingC in conjunction with the one-against-one binarization method, we came to this same conclusion. An explanation might be that, as the number of classes in a problem increases, the greater is the chance that any of the
      k    ⁡          (              k        -        1            )        2base classifiers will give a wrong prediction. There are two reasons for this. First, when predicting the class of an instance, only out of
      k    ⁡          (              k        -        1            )        2classifiers may predict correctly. This is because only k−1 classifiers were trained on any specific class. We can see that as k increases, the percentage of classifiers which may classify correctly is decreasing, and will descend practically to zero:
                                          lim                          k              →              ∞                                ⁢                                    k              -              1                                                      k                ⁡                                  (                                      k                    -                    1                                    )                                            2                                      =                                            lim                              k                →                ∞                                      ⁢                          2              k                                =          0                                    (        1        )            
The second reason is that in one-against-one binarization we use only instances of two classes—the instances of each one of the pair classes, while in one-against-all we use all instances, and thus the number of training instances for each base classifier in one-against-one binarization is much smaller than in the one-against-all binarization method. Thus using the one-against-one binarization method may yield inferior base classifier.
There are several alternatives to decompose the multiclass problem into binary subtasks. Lorena and de Carvalho in [Lorena A. and de Carvalho A. C. P. L. F.: Evolutionary Design of Code-matrices for Multiclass Problems, Soft Computing for Knowledge Discovery and Data Mining, Springer US, 153-184, 2007] survey all popular methods. The most straightforward method to convert k class classification problems into k-two class classification problems has been proposed by Anand in [Anand R, Methrotra K, Mohan C K, Ranka S. Efficient classification for multiclass problems using modular neural networks. IEEE Trans Neural Networks, 6(1): 117-125, 1995]. Each problem considers the discrimination of one class to the other classes. Lu and Ito in [Lu B. L., Ito M., Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification, IEEE Trans. on Neural Networks, 10(5):1244-1256, 1999.] extend Anand's method and propose a new method for manipulating the data based on the class relations among the training data. By using this method, they divide a k class classification problem into a series of k(k−1)/2 two-class problems where each problem considers the discrimination of one class to each one of the other classes. The researchers used neural networks to examine this idea. A general concept aggregation algorithm called Error-Correcting Output Coding (ECOC) uses a code matrix to decompose a multi-class problem into multiple binary problems [Dietterich, T. G., and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263-286, 1995.]. ECOC for multi-class classification hinges on the design of the code matrix.
Sivalingam et al. in [Sivalingam D., Pandian N., Ben-Arie J., Minimal Classification Method With Error-Correcting Codes For Multiclass Recognition, International Journal of Pattern Recognition and Artificial Intelligence 19(5): 663-680, 2005.] propose to transform a multiclass recognition problem into a minimal binary classification problem using the Minimal Classification Method (MCM) aided with error correcting codes. The MCM requires only log2k classifications because instead of separating only two classes at each classification, this method separate two groups of multiple classes. Thus the MCM requires small number of classifiers and still provide similar accuracy performance.
Data-driven Error Correcting Output Coding (DECOC) [Zhoua J., Pengb H., Suenc C., Data-driven decomposition for multi-class classification, Pattern Recognition 41: 67-76, 2008.] explores the distribution of data classes and optimizes both the composition and the number of base learners to design an effective and compact code matrix. Specifically, DECOC calculate the confidence score of each base classifier based on the structural information of the training data and use sorted confidence scores to assist the determination of code matrix of ECOC. The results show that the proposed DECOC is able to deliver competitive accuracy compared with other ECOC methods, using parsimonious base learners than the pairwise coupling (one vs. one) decomposition scheme.
It should be noted that finding new methods for converting multiclass classification problems into binary classification problems is not one of the goals of this paper. Still, we are using in our experimental study three different methods for this conversion.
It is therefore a purpose of the present invention to provide a method and a system overcoming the limitations of the existing approaches.
It is another purpose of the present invention to provide a method and a system allowing efficient classification and prediction of dataset values.
It is yet another purpose of the present invention to provide a method and a system for performing efficient predictions based on data classification.
Further purposes and advantages of this invention will appear as the description proceeds.