A Support Vector Machine (SVM) is a universal learning machine that, during a training phase, determines a decision surface or “hyperplane”. The decision hyperplane is determined by a set of support vectors selected from a training population of vectors and by a set of corresponding multipliers. The decision hyperplane is also characterised by a kernel function.
Subsequent to the training phase a SVM operates in a testing phase during which it is used to classify test vectors on the basis of the decision hyperplane previously determined during the training phase. A problem arises however as the complexity of the computations that must be undertaken to make a decision scales with the number of support vectors used to determine the hyperplane.
Support Vector Machines find application in many and varied fields. For example, in an article by S. Lyu and H. Farid entitled “Detecting Hidden Messages using Higher-Order Statistics and Support Vector Machines” (5th International Workshop on Information Hiding, Noordwijkerhout, The Netherlands, 2002) there is a description of the use of an SVM to discriminate between untouched and adulterated digital images.
Alternatively, in a paper by H. Kim and H. Park entitled “Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3d local descriptor” (Proteins: structure, function and genetics, to be published) SVMs are applied to the problem of predicting high resolution 3D structure in order to study the docking of macro-molecules.
The mathematical basis of a SVM will now be explained. An SVM is a learning machine that selects m random vectors x∈Rd, drawn independently from the probability distribution function p(x). The system then returns an output value for every input vector xi, such that ƒ(xi)=yi.
The (xi, yi) i=0, . . . m are referred to as the training examples. The resulting function ƒ(x) determines the hyperplane which is then used to estimate unknown mappings.
FIG. 1, illustrates the above method. Each of steps 24, 26 and 28 of FIG. 1 are well known in the prior art.
With some manipulations of the governing equations the support vector machine can be phrased as the following Quadratic Programming problem:min W(a)=½aTΩa−aTe  (1)whereΩi,j=yiyjK(xi,xi)  (2)e=[1,1,1,1, . . . 1]T  (3)Subject to 0=aTy  (4)0≦ai≦C  (5)whereC is some regularization constant.  (6)
The K(xi,xi) is the kernel function and can be viewed as a generalized inner product of two vectors. The result of training the SVM is the determination of the multipliers ai.
Suppose we train a SVM classifier with pattern vectors xi, and that r of these vectors are determined to be support vectors, denote them by xi, i=1, 2 . . . , r. The decision hyperplane for pattern classification then takes the form
                              f          ⁡                      (            x            )                          =                                            ∑                              i                =                1                            r                        ⁢                                          α                i                            ⁢                              y                i                            ⁢                              K                ⁡                                  (                                      x                    ,                                          x                      i                                                        )                                                              +          b                                    (        7        )            
where ai is the Lagrange multiplier associated with pattern xi and K(. , .) is a kernel function that implicitly maps the pattern vectors into a suitable feature space. The b can be determined independently of the ai. FIG. 2 illustrates in two dimensions the separation of two classes by a hyperplane 30. Note that all of the x's and o's contained within a rectangle in FIG. 2 are considered to be support vectors and would have associated non-zero ai.
Now suppose that support vector xk is linearly dependent on the other support vectors in feature space, i.e.
                              K          ⁡                      (                          x              ,                              x                k                                      )                          =                              ∑                                          i                =                1                                            i                ≠                k                                      r                    ⁢                                    c              i                        ⁢                          K              ⁡                              (                                  x                  ,                                      x                    i                                                  )                                                                        (        8        )            where ci are some scalars.
Then the decision surface defined by equation (7) can be written as
                              f          ⁡                      (            x            )                          =                                            ∑                                                i                  =                  1                                                  i                  ≠                  k                                            r                        ⁢                                          α                i                            ⁢                              y                i                            ⁢                              K                ⁡                                  (                                      x                    ,                                          x                      i                                                        )                                                              +                                    α              k                        ⁢                          y              k                        ⁢                                          ∑                                                      i                    =                    1                                                        i                    ≠                    k                                                  r                            ⁢                                                c                  i                                ⁢                                  K                  ⁡                                      (                                          x                      ,                                              x                        i                                                              )                                                                                +          b                                    (        9        )            
Now define akykci=aiyiyi so that (9) can be written
                              f          ⁡                      (            x            )                          =                                            ∑                                                i                  =                  1                                                  i                  ≠                  k                                            r                        ⁢                                                            α                  i                                ⁡                                  (                                      1                    +                                          γ                      i                                                        )                                            ⁢                              y                i                            ⁢                              K                ⁡                                  (                                      x                    ,                                          x                      i                                                        )                                                              +          b                                    (        10        )                                                          ⁢                  =                                                    ∑                                  i                  =                  1                                r                            ⁢                                                α                  i                  ′                                ⁢                                  y                  i                                ⁢                                  K                  ⁡                                      (                                          x                      ,                                              x                        i                                                              )                                                                        +                          b              ⁢                                                          ⁢              where                                                          (        11        )                                          α          i          ′                =                              α            i                    ⁡                      (                          1              +                              γ                i                                      )                                              (        12        )            
Comparing (11) and (7) we see that the linearly dependent support vector xk is not required in the representation of the decision surface. Note that the Lagrange multipliers must be modified in order to obtain the simplified representation. This process, (described in T. Downs, K. E. Gates, and A. Masters. “Exact simplification of support vector solutions”. Journal of Machine Learning Research, 2:293-297, 200) is a successful way of reducing the support vectors after they have been calculated.
FIG. 3 depicts the same hyperplane as in FIG. 2, but this time the number of support vectors has been reduced to just two vectors 32 through the process of determining a linearly independent set of support vectors.
Given either (11) or (7) an un-classified sample vector x may be classified by calculating ƒ(x) and then returning −1 for all values less than zero and 1 for all values greater than zero.
FIG. 4 is a flow chart of a typical method employed by prior art SVMs for classifying an unknown vector. Steps 34 through 40 are defined in the literature and by equations (7) or (11).
As previously alluded to, because the sets of training vectors may be very large and the time involved to train the SVM may be excessive it would be desirable K it were possible to undertake an a-priori reduction of the training set before the calculation of the support vectors.
It will be realised from the above discussion that a reduced set of vectors might be arrived at by choosing only linearly independent vectors. The determination of the linearly independent support vectors may be undertaken by any method commonly in use in linear algebra. Common methods would be the calculation with pivoting of the reduced row echelon form, the QR factors or the Singular value decomposition. Any of these methods would give a set of r linearly independent vectors that could then be used to calculate the Lagrange multipliers and a decision surface similar to that defined by equation (7). A problem arises however in that it is not clear how to optimally select the support vectors that will be kept in the set.
It is an object of the present invention to provide an improved method for selecting support vectors in a Support Vector Machine.