Large margin or regularized classifiers such as support vector machines (SVMs), Adaboost, and Maximum Entropy ((Maxent) in particular Sequential L1-Regularized Maxent) are obvious choices for use in semantic classification. (For additional information on these classifiers see Vladimir N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998; see also Yoav Freund and Robert E. Schapire, Experiments with a New Boosting Algorithm, In Proceedings of ICML '96, pages 148-156, 1996; see also Miroslav Dudik, Steven Phillips, and Robert E. Schapire, Performance Guarantees for Regularized Maximum Entropy Density Estimation, In Proceedings of COLT '04, Banff, Canada, 2004, Springer Verlag; see also Miroslav Dudik, Steven Phillips, and Robert E. Schapire, Performance Guarantees for Regularized Maximum Entropy Density Estimation, In Proceedings of COLT '04, Banff, Canada, 2004, Springer Verlag).
These linear classifiers are well known in the art as all three large margin or regularized classifiers offer learning processes that are scalable with runtime implementations that may be very efficient. These algorithms give users three ways to train a linear classifier using very different frameworks. Each of these algorithms may be used across multiple processors or clusters of computers (parallelization) to increase learning speed.
SVMs look for a separating hyperplane 1002 with a maximum separation margin 1010 between two classes, as shown in FIG. 1. A first set of classes 1006 and a second set of classes 1008 are separated on the separation hyperplane. The hyperplane 1002 can be expressed as a weight vector. The margin is the minimum distance between the projections of the points of each class on the direction of the weight vector. The maximum margin 1010 is shown in FIG. 1 as the distance between the first set of classes 1006 and the second set of classes 1008. FIG. 1 represents hard margin SVMs, where classification errors on the training set are not permitted. For discussion purposes in the remaining description of the invention, SVMs are generalized as soft margin SVMs, which allow some errors, i.e. vectors that are inside or on the wrong side of the margin.
AdaBoost incrementally refines a weighted combination of weak classifiers. AdaBoost selects at each iteration a feature k and computes, analytically or using line search, the weight wk that minimizes a loss function. In the process, the importance of examples that are still erroneous is “adaptively boosted”, as their probability in a distribution over the training examples that is initially uniform is increased. Given a training set associating a target class yi to each input vector xi, the AdaBoost sequential algorithm looks for the weight vector w that minimizes the exponential loss (which is shown to bound the training error):
  C  =            ∑              i        =        1            M        ⁢                  ⁢          exp      ⁡              (                              -                          y              i                                ⁢                      w            T                    ⁢                      x            i                          )            
AdaBoost also allows a log-loss model, where the goal is to maximize the log-likelihood log of the training data log(ΠiP(yi|xi)). The posterior probability to observe a positive example is
      P    ⁡          (                        y          i                =                  1          ❘          x                    )        =      1          1      +              exp        ⁡                  (                                    -                              w                T                                      ⁢                          x              i                                )                    
Finally, Maxent relies on probabilistic modeling. In particular, ones assumes that the classification problem is solved by looking for the class y which maximizes a distribution argmax P(y|x). (See; Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra, A maximum entropy approach to natural language processing, Computational Linguistics, 22(1):39-71, 1996).
First, how well this distribution matches the training data is represented by constraints which state that features must have the same means under the empirical distribution (measured on the training data) and under the expected distribution (obtained after the training process). Second, this distribution must be as simple as possible. This can be represented as a constrained optimization problem: find the distribution over training samples with maximum entropy that satisfies the constraints. Using convex duality, one obtains as a loss function the Maximum Likelihood. The optimization problem is applied to a Gibbs distribution, which is exponential in a linear combination of the features:
      P    ⁡          (      x      )        =            exp      ⁡              (                              w            T                    ⁢          x                )              Z  with Z=Σi=1Mexp(wTxi).
Sequential L1-Regularized Maxent is one of the fastest and most recent algorithms to estimate Maxent models. It offers a sequential-update algorithm which is particularly efficient on sparse data and allows the addition of L1-Regularization to better control generalization performance.
AdaBoost, which is large margin classifier and implicitly regularized, and Maxent, which is explicitly regularized, have also been shown, both in theory and experimentally, to generalize well in the presence of a large number of features. Regularization favors simple model by penalizing large or non-zero parameters. This property allows the generalization error, i.e. the error on test data, to be bounded by quantities which are nearly independent of the number of features, both in the case of AdaBoost and Maxent. This is why a large number of features, and consequently a large number of classifier parameters, do not cause the type of overfitting (i.e. learning by heart the training data) that used to be a major problem with traditional classifiers.
All of the above algorithms have shown excellent performance on large scale problems. Adaboost and Maxent are particularly efficient when the data is sparse, that is when the proportion of non-zero input features is small. Their learning scales linearly as a function of the number of training samples N. Support Vector Machines, whose classification capacity is more powerful, has a learning time in theory slower, with a learning time that scales quadratically in N. A recent breakthrough has considerably improved SVM speed on sparse data. (See; Haffner, Fast transpose methods for sparse kernel learning, Presented at NIPS '05 workshop on Large Scale Kernel Machines, 2005).
However, these computationally expensive learning algorithms cannot always handle the very large number of examples, features, and classes present in the training corpora that are available. For example, in large scale natural language processing, network monitoring, and mining applications, data available to train these algorithms can represent millions of examples and thousands of classes. With this amount of data, learning times may become excessive.
Furthermore, large margin classifiers were initially demonstrated on binary classification problems, where the definition of the margin is unambiguous and its impact on generalization is reasonably well understood. The simplest multiclass classification scheme is to train C binary classifiers, each trained to distinguish the examples belonging to one class from examples not belonging to this class. The situation where each example can only belong to one class leads to training one class versus all other classes. That is why this scheme is referred to as 1-vs-other or 1-vs-all. However, many other combinations of binary classifiers are possible, in particular 1-vs-1 (also called all-vs-all) where each classifier is trained to separate a pair of classes.
Current methods used to improve learning speed focus on a single binary classifier. However, these current methods do not guarantee a global optimal solution for different classifiers. Furthermore, current runtimes are unacceptable when processing a large number of examples with numerous classes. The run time of a deployed system should be on the order of milliseconds. A manageable learning time should be in the order of hours, with memory requirements not exceeding a few gigabytes. To remain within these constraints, one is often led to choose sub-optimal solutions.
Therefore, there is a need in the art for methods in which learning times are improved for large margin classifiers. Such methods may be used in interactive voice response systems or other systems requiring the handling of a large number of examples or features.