1. Technical Field
The present invention relates to support vector machines for machine learning and more particularly to a generalized Sequential Minimum Optimization (SMO) system and method for solving the support vector machine optimization problem.
2. Description of the Related Art
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. SVMs belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize empirical classification error and maximize geometric margin; hence they are also known as maximum margin classifiers.
Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. An assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalization error of the classifier will be.
Data is classified as a part of a machine-learning process. Each data point is represented by a p-dimensional vector (a list of p numbers). Each of these data points belongs to only one of two classes. We are interested in whether we can separate them with a “p minus 1” dimensional hyperplane. This is a typical form of a linear classifier. There are many linear classifiers that might satisfy this property. However, we are additionally interested in finding out if we can achieve maximum separation (margin) between the two classes. By this we mean that we pick the hyperplane so that the distance from the hyperplane to the nearest data point is maximized. That is to say that the nearest distance between a point in one separated hyperplane and a point in the other separated hyperplane is maximized. Now, if such a hyperplane exists, it is clearly of interest and is known as the maximum-margin hyperplane and such a linear classifier is known as a maximum margin classifier.
Recently, a generalization of a support vector machine (SVM) technique, called support vector machine plus (SVM+), was proposed by V. Vapnik, in Estimation of Dependences Based on Empirical Data: Empirical Inference Science, Springer, 2006. The SVM+ approach is designed to take advantage of structure in training data (for example, noise present in data, or invariants in the data). By leveraging this structure, the SVM+ technique can have a better generalization by lowering the overall system's VC-dimension.
While multiple methods for training SVM have been proposed (the leading one being Sequential Minimum Optimization (SMO)), there are no available methods for training SVM+.
SVM and Its Computation Using SMO: In 1995, the SVM method for constructing an optimal hyperplane for non-separable data was introduced (see C. Cortes, V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273-297, 1995. The method deals with the following problem.
Given training data: ({right arrow over (x)}1,y1), . . . , ({right arrow over (x)}l,yl) yε{−1,1}, {right arrow over (x)}ε find the parameters {right arrow over (w)} and b of the hyperplane ({right arrow over (w)}, {right arrow over (x)})+b=0 that separate the data {({right arrow over (x)}1,y1), . . . , ({right arrow over (x)}l,yl)} (perhaps, with some errors) and minimize the functional:
      R    =                  (                              w            →                    ,                      w            →                          )            +              C        ⁢                              ∑                          i              =              1                        ℓ                    ⁢                      ξ            i                                ,      i    =    1    ,  …  ⁢          ,  ℓunder the constraints: yi[({right arrow over (w)}, {right arrow over (x)}i)+b]≧1−ξi, ξi≧0, i=1, . . . , l.
Here slack variables ξi characterize the values of training errors, while C′ is the penalty of these errors in the functional. Using standard techniques, this problem can be converted to the dual form, which would then require minimizing the functional;
  W  =                    ∑                  i          =          1                ℓ            ⁢              α        i              -                  1        2            ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                              y            i                    ⁢                      y            j                    ⁢                      α            i                    ⁢                                    α              j                        ⁡                          (                                                                    x                    →                                    i                                ,                                                      x                    →                                    j                                            )                                          over parameters αi (Lagrange multipliers), subject to the constraints:
                    ∑                  i          =          1                ℓ            ⁢                        y          i                ⁢                  α          i                      =    0    ,      0    ≤          α      i        ≤          C      .      
The desired separating hyperplane has the form
                    ∑                  i          =          1                ℓ            ⁢                        y          i                ⁢                              α            i                    ⁡                      (                                                            x                  →                                i                            ,                              x                →                                      )                                +    b    =  0where parameters α1, . . . , αl and b are the solution of the above optimization problem.
Generally, one cannot expect that the data {({right arrow over (x)}1,y1), . . . , ({right arrow over (x)}l,yl)} can be separated by a linear function (hyperplane). Thus, B. Baser, I. Guyon, V. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Vol. 5, pp. 144-152, 1992, it was shown how the so-called “kernel trick” can be used for constructing wide classes of nonlinear separating functions. To employ the kernel trick, one maps input vectors {right arrow over (x)}εX (original feature space) into vectors {right arrow over (z)}εZ (new feature space, image space), where one constructs the separating hyperplane
                    ∑                  i          =          1                ℓ            ⁢                        y          i                ⁢                              α            i                    ⁡                      (                                                            z                  →                                i                            ,                              z                →                                      )                                +    b    =  0with parameters α1, . . . , αl and b that maximize the functional subject to the constraints
         {                                                                                        ∑                                      i                    =                    1                                    ℓ                                ⁢                                                                  ⁢                                                      y                    i                                    ⁢                                      α                    i                                                              =              0                        ,                                                            0            ≤                          α              i                        ≤                          C              .                                          
According to the Mercer's theorem (as explained in Boser et al.), for any inner product ({right arrow over (z)}i,{right arrow over (z)}j) in the image space Z there exists a positive definite function K({right arrow over (x)}i,{right arrow over (x)}j) in space X such that ({right arrow over (z)}i,{right arrow over (z)}j)=K({right arrow over (x)}i,{right arrow over (x)}j), i, j=1, . . . , l.
Conversely, for any positive definite function K({right arrow over (x)}i,{right arrow over (x)}j) in the space X, there exists such a space Z that K({right arrow over (x)}i,{right arrow over (x)}j) forms an inner product ({right arrow over (z)}i,{right arrow over (z)}j) in space Z. Therefore, to construct a nonlinear separating function
                    ∑                  i          =          1                ℓ            ⁢                          ⁢                        y          i                ⁢                  α          i                ⁢                  K          ⁡                      (                                                            x                  ⇀                                i                            ,                              x                ⇀                                      )                                +    b    =  0in the image space Z, one has to maximize the functional:
  W  =                    ∑                  i          =          1                ℓ            ⁢                          ⁢              α        i              -                  1        2            ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                                  ⁢                              y            i                    ⁢                      y            j                    ⁢                      α            i                    ⁢                      α            j                    ⁢                      K            ⁡                          (                                                                    x                    →                                    i                                ,                                                      x                    →                                    j                                            )                                          subject to the constraints:
         {                                                                                        ∑                                      i                    =                    1                                    ℓ                                ⁢                                                                  ⁢                                                      y                    i                                    ⁢                                      α                    i                                                              =              0                        ,                                                            0            ≤                          α              i                        ≤                          C              .                                          
This problem is a special form of a quadratic optimization problem, where the constraints consist of one equality constraint and l box constraints. As it was demonstrated in the past, the problem can be solved much more efficiently than a general quadratic programming problem (see details in D. Bertsekas, Convex Analysis and Optimization, Athena Scientific, 2003).
J. Platt, in “Fast Training of Support Vector Machines using Sequential Minimal Optimization,” in Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, eds., pp. 185-208, MIT Press, 1999, proposed one of the most efficient algorithms for solving this problem, the so-called sequential minimal optimization (SMO) algorithm. The idea was to solve the optimization problem by sequentially, in each step selecting a pair of Lagrange multipliers αi, αj and maximizing the functional over them while keeping the rest of Lagrange multipliers fixed. The resulting two-dimensional optimization problem has a closed-form solution, which can be found extremely fast. By choosing appropriate pairs of Lagrange multipliers for each step for sequentially maximizing the functional, the SMO algorithm finds the desired solution quickly. This SMO algorithm made the SVM method extremely efficient for problems involving large amount of data in high-dimensional spaces.
SVM+ as an Extension of SVM: In the book, V. Vapnik, Estimation of Dependences Based on Empirical Data: Empirical Inference Science, Springer, 2006, and technical report, V. Vapnik, M. Miller, “SVM+: A new learning machine that considers the global structure of the data,” NECLA TR 2005-L141, 2005, a generalization of the SVM method of constructing separating functions, the so-called SVM+ method, was introduced. The idea of the generalization is the following. Consider the slack variables ξi in the form ξi=ψ({right arrow over (x)}i,δ), δεD, where ψ({right arrow over (x)}i,δ) belongs to some admissible set of functions (we call them correcting functions). In classical SVM slacks, ξi can take arbitrary values. By introducing slacks that are a realization of one of the admissible functions, we try to reduce the overall VC dimension; improve the generalization quality of the decision rule; introduce a richer training environment: instead of an oracle (providing just +1 or −1 for classification yi), we can exploit the teacher's input (providing hidden knowledge on classification errors ξi).
Note that extra hidden information is only used during training, not in actual testing. In SVM+, we map any input vector {right arrow over (x)} into two different spaces: space {right arrow over (z)}εZ as in SVM method (called the space of decision functions ξi) and in another space {right arrow over (z)}εZ+ (called the space of correcting functions) defining our admissible set of functions as follows: ψ({right arrow over (x)},δ)=({right arrow over (w)}, {right arrow over (z)})+d, where {right arrow over (w)}+εZ+ and dε Therefore, given the training data and two spaces Z and Z+, we define triplets (y1,{right arrow over (z)}1,{right arrow over (z)}1+), . . . , (yl,{right arrow over (z)}l,{right arrow over (z)}l+).
Our goal is to construct the separating hyperplane ({right arrow over (w)},{right arrow over (z)})+b=0, in the decision space Z subject to the constraints
         {                                                                      y                i                            ⁡                              [                                                      (                                                                  w                        →                                            ,                                                                        z                          →                                                i                                                              )                                    +                  b                                ]                                      ≥                          1              -                              (                                                      (                                                                                            w                          →                                                +                                            ,                                                                        z                          →                                                i                        +                                                              )                                    +                  d                                )                                                                                                                    (                                                                            w                      →                                        +                                    ,                                                            z                      →                                        i                    +                                                  )                            +              d                        ≥            0                              that minimizes the functional
  R  =            C      ⁢                        ∑                      i            =            1                    ℓ                ⁢                                  ⁢                  (                                    (                                                                    w                    →                                    +                                ,                                                                            z                      →                                                                                                                      i                  +                                            )                        +            d                    )                      +          (                        w          →                ,                  w          →                    )        +                  γ        ⁡                  (                                                    w                →                            +                        ,                                          w                →                            +                                )                    .      
The dual form solution to this problem is to maximize the functional
  W  =                    ∑                  i          =          1                ℓ            ⁢                          ⁢              α        i              -                  1        2            ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                                  ⁢                              y            i                    ⁢                      y            j                    ⁢                      α            i                    ⁢                                    α              j                        ⁡                          (                                                                    z                    →                                    i                                ,                                                      z                    →                                    j                                            )                                            -                  1                  2          ⁢          γ                    ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                                  ⁢                              (                                          α                i                            +                              β                i                            -              C                        )                    ⁢                      (                                          α                j                            +                              β                j                            -              C                        )                    ⁢                      (                                                            z                  →                                i                +                            ,                                                z                  →                                j                +                                      )                              
subject to the constraints:
         {                                                                      α                i                            ≥              0                        ,                                                  ⁢                          i              =              1                        ,            …            ⁢                                                  ,            ℓ                                                                                          β                i                            ≥              0                        ,                                                  ⁢                          i              =              1                        ,            …            ⁢                                                  ,            ℓ                                                                                          ∑                                  i                  =                  1                                ℓ                            ⁢                                                          ⁢                                                y                  i                                ⁢                                  α                  i                                                      =            0                                                                                          ∑                                  i                  =                  1                                ℓ                            ⁢                                                          ⁢                              (                                                      α                    i                                    +                                      β                    i                                    -                  C                                )                                      =            0                              
The solution (over parameters α, β, b, d) defines the separating function (in the decision space Z)
                              ∑                      i            =            1                    ℓ                ⁢                                  ⁢                              y            i                    ⁢                                    α              i                        ⁡                          (                                                                    z                    →                                    i                                ,                                  z                  →                                            )                                          +      b        =    0    ;the correcting function has the form:
      ψ    ⁡          (      z      )        =                    1        γ            ⁢                        ∑                      i            =            1                    ℓ                ⁢                                  ⁢                              (                                          α                i                            +                              β                i                            -              C                        )                    ⁢                      (                                                            z                  →                                i                +                            ,                                                z                  →                                +                                      )                                +          d      .      
Using the same kernel trick for two different spaces Z and Z+ and denoting by K({right arrow over (x)}i,{right arrow over (x)}j) and K+({right arrow over (x)}i,{right arrow over (x)}j) the corresponding kernels for these spaces (we can call them decision space kernel and correction space kernel), we can formulate the SVM+ problem as follows. Maximize the functional;
  W  =                    ∑                  i          =          1                ℓ            ⁢                          ⁢              α        i              -                  1        2            ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                                  ⁢                              y            i                    ⁢                      y            j                    ⁢                      α            i                    ⁢                      α            j                    ⁢                      K            ⁡                          (                                                                    x                    →                                    i                                ,                                                      x                    →                                    j                                            )                                            -                  1                  2          ⁢          γ                    ⁢                        ∑                      i            ,                          j              =              1                                ℓ                ⁢                                  ⁢                              (                                          α                i                            +                              β                i                            -              C                        )                    ⁢                      (                                          α                j                            +                              β                j                            -              C                        )                    ⁢                                    K              +                        ⁡                          (                                                                    x                    →                                    i                  +                                ,                                                      x                    →                                    j                  +                                            )                                          subject to the constraints:
         {                                                                      α                i                            ≥              0                        ,                                                  ⁢                          i              =              1                        ,            …            ⁢                                                  ,            ℓ                                                                                          β                i                            ≥              0                        ,                                                  ⁢                          i              =              1                        ,            …            ⁢                                                  ,            ℓ                                                                                          ∑                                  i                  =                  1                                ℓ                            ⁢                                                          ⁢                                                y                  i                                ⁢                                  α                  i                                                      =            0                                                                                          ∑                                  i                  =                  1                                ℓ                            ⁢                                                          ⁢                              (                                                      α                    i                                    +                                      β                    i                                    -                  C                                )                                      =            0                              
The decision function for SVM+ has the form
                              ∑                      i            =            1                    ℓ                ⁢                                  ⁢                              y            i                    ⁢                      α            i                    ⁢                      K            ⁡                          (                                                                    x                    →                                    i                                ,                                  x                  →                                            )                                          +      b        =    0    ;the correcting function has a form
      ψ    ⁡          (      z      )        =                    1        γ            ⁢                        ∑                      i            =            1                    ℓ                ⁢                                  ⁢                              (                                          α                i                            +                              β                i                            -              C                        )                    ⁢                                    K              +                        ⁡                          (                                                                    x                    →                                    i                  +                                ,                                                      x                    →                                    +                                            )                                            +          d      .      
In V. Vapnik, Estimation of Dependences Based on Empirical Data: Empirical Inference Science, Springer, 2006, a new setting of the learning problem was introduced (we call it learning hidden information), where for the training stage one is given triplets: (y1,{right arrow over (x)}1,{right arrow over (x)}1+), . . . , (yl,xl,{right arrow over (x)}l+), yε{−1,1}, {right arrow over (x)}ε{right arrow over (x)}+ε It is required to construct a decision rule y=f({right arrow over (x)}) that partitions the data {{right arrow over (x)}1, . . . , {right arrow over (x)}l} into two categories. Vector {right arrow over (x)}+ can be considered as a hint: it is available only for the training stage, and it will be hidden for the test stage. The problem of learning using hidden information can be solved using the SVM+ method where the vector {right arrow over (x)} is mapped into space Z and the vector {right arrow over (x)}+ is mapped into the space Z+. In order to construct a decision rule, SVM+ takes into account hidden information.