1. Field of the Invention
The present invention relates to an SVM (Support Vector Machine) for classifying many objects based on their multiple characteristics, more specifically, to a condensed SVM for high-speed training using a large collection of data.
2. Description of the Related Art
Given training data xi (here, i=1, 2, . . . , I) with labels yi of −1 and +1, a major task of SVM learning is to solve the following quadratically constrained quadratic programming problem (QP problem) of Formula 1.
                              [                      Formula            ⁢                                                  ⁢            1                    ]                ⁢                                                                                                                ⁢                                                            min                α                            ⁢                              L                ⁡                                  (                  α                  )                                                      =                                                            1                                      2                    ⁢                                                                                                                ⁢                                                      ∑                                          i                      ,                                              j                        =                        1                                                              l                                    ⁢                                                            y                                              i                        ⁢                                                                                                                                        ⁢                                          y                      j                                        ⁢                                          α                      i                                        ⁢                                          α                      j                                        ⁢                                          K                      ⁡                                              (                                                                              x                            i                                                    ,                                                      x                            j                                                                          )                                                                                                        -                                                ∑                                      i                    =                    1                                    l                                ⁢                                  α                  i                                                              ⁢                                          ⁢                                                    subject                ⁢                                                                  ⁢                to                ⁢                                                                  ⁢                                                                            ∑                                              i                        =                        1                                                              l                                    ⁢                                                            y                      i                                        ⁢                                          α                      i                                                                                  =              0                        ,                                                  ⁢                          0              ≤                              α                i                            ≤                                                C                  i                                ⁡                                  (                                                            i                      =                      1                                        ,                    …                    ⁢                                                                                  ,                    l                                    )                                                                                        (        1        )            
Where K (xi, xj) is a kernel function for calculating an inner product between two vectors xi and xj in some feature spaces, and Ci (i=1, 2, . . . , I) is a parameter penalizing noisy training data in the given training data.
When attempting to solve the above problem, the following difficulties arise when the number Is of the training data becomes larger.    1) Problem in capacity of a memory storing the kernel matrix Kij=K(xi, xj) (here, i, j=1, 2, . . . , I). The data amount of the kernel matrix easily exceeds the memory capacity of a conventional computer.    2) Computational complexity to compute a kernel value Kij (i, j=1, 2, . . . , I).    3) Computational complexity to solve the QP problem.
In the testing phase, the decision function f(x) of SVM is expressed by Formula 2, and is composed of a set of Ns training data xi (i=1, 2, . . . , Ns) called support vectors (SVs).
                              [                      Formula            ⁢                                                  ⁢            2                    ]                ⁢                                                                                                f          ⁡                      (            x            )                          =                                            ∑                              i                =                1                                            N                S                                      ⁢                                          α                i                            ⁢                              K                ⁡                                  (                                                            x                      i                                        ,                    x                                    )                                                              +          b                                    (        2        )            
Complexity of the decision function f(x) of the SVM linearly increases as the number Ns of support vectors increases. When this number becomes larger, the speed of SVM in the testing phase becomes slower due to an increase in the amount of computations of the kernel value K (xi, x) (i=1, 2, . . . , Ns).
Therefore, conventionally, the following various methods have been proposed to improve the scalability of the support vector learning in both the training and testing phases.
1. Decomposition Algorithms (the Following Non-Patent Documents 2, 3, 4, and 5)
The decomposition method decomposes the original QP into a series of much smaller QPs, and then optimizes these sub-problems. Training data are divided into two parts: a set of active vectors and a set of inactive vectors. In the set of active vectors or working set, the coefficient αi can be updated. On the other hand, in the set of inactive vectors, the coefficient αi is temporarily fixed. The optimization algorithm runs only on a small number of working data, not on the whole data. Therefore, the problem in which the memory capacity increases by a square and computational complexity increases by a cube is avoided. In each optimization loop, the working data are updated to find a new SVM solution. The process training (learning) will finish when optimal conditions are satisfied.
2. Parallelization (the Following Non-Patent Documents 6 and 7)
The training speed of SVM can be improved effectively by using a parallel algorithm running on a parallel computer.
3. Data Sampling (the Following Non-Patent Documents 8, 9, and 10)
Various methods for selecting important training data have been proposed to reduce the size of the optimization problem of Formula 1. An SVM which is learned from a small amount of data can have good performance in many cases.
4. Reduced Set Method for SVM Simplification (the Following Non-Patent Documents 11 and 12)
To increase the speed of SVM in the testing phase, a reduced set method replaces the SVM decision function (see Formula 2) having Ns SVs by a simplified SVM decision function consisting of Nz vectors called reduced vectors (Nz<Ns). It is practically shown that the reduced set method can produce a simplified SVM with similar performance to that of the conventional SVM.    Non-Patent Document 1: C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273-297, 1995.    Non-Patent Document 2: E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines,” in Neural Networks for Signal Processing VII—Proceedings of the 1997 IEEE Workshop, N. M. J. Principe, L. Gile and E. Wilson, Eds., New York, pp. 276-285, 1997.    Non-Patent Document 3: T. Joachims, “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods: Support Vector Machines, A. S. B. Scholkopf, C. Burges, Ed., MIT Press, Cambridge, Mass., 1998.    Non-Patent Document 4: J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods-Support Vector Learning, B. Scholkopf, C. J. C. Burges, and A. J. Smola, Eds., Cambridge, Mass.: MIT Press, 1999.    Non-Patent Document 5: Duc Dung Nguyen; Matsumoto, K.; Takishima, Y.; Hashimoto, K.; Terabe, M., “Two-stage incremental working set selection for fast support vector training on large datasets,” Research, Innovation and Vision for the Future, 2008. RIVF2008. IEEE International Conference on, vol., no., pp. 221-226, 13-17 Jul. 2008.    Non-Patent Document 6: R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of svms for very large scale problems,” Neural Computation, vol. 14, no. 5, pp. 1105-1114, 2002.    Non-Patent Document 7: G. H. Peter, C. Eric, B. L'eon, D. Igor, and V. Vladimir, “Parallel support vector machines: The Cascade SVM,” in Advances in Neural Information Processing Systems, L. Saul, Y. Weiss, and L. Bottou, Eds., vol. 17. MIT Press, 2005.    Non-Patent Document 8: Y.-J. Lee and O. L. Mangasarian, “Rsvm: Reduced support vector machines,” in Proceedings of the First SIAM International Conference on Data Mining. Morgan Kaufmann, San Francisco, Calif., 2001.    Non-Patent Document 9: A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classifiers with online and active learning,” Journal of Machine Learning Research, vol. 6, pp. 1579-1619, 2005.    Non-Patent Document 10: I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines: Fast svm training on very large data sets,” J. Mach. Learn. Res., vol. 6, pp. 363-392, 2005.    Non-Patent Document 11: C. J. C. Burges, “Simplified support vector decision rules,” in Proc. 13th International Conference on Machine Learning, San Mateo, Calif., 1996, pp. 71-77.    Non-Patent Document 12: Nguyen, D. D., Ho, T. B. A Bottom-up Method for Simplifying Support Vector Solutions, IEEE Transactions on Neural Networks, Vol. 17, No. 3, 792-796, 2006.
The methods have the following problems:
1. Decomposition Algorithms
When working on a large amount of data (e.g. training data more than 100,000), the convergence speed becomes slow. Computational complexity increases by a cube of the number of support vectors and the memory capacity increases by a square of the number of support vectors.
2. Parallelization
Designing an algorithm suitable for making the communication cost reasonable, questions still remain in computing ability and kernel caching in practice. Moreover, improving the speed of optimization through parallelization is difficult due to dependency between computation steps.
3. Data Sampling
The biggest issue of this process is the degradation of the trained SVM because only limited information (training data) is used for optimization. In addition, it is difficult to select a suitable sampling method for each practical application.
4. Reduced Set Method for SVM Simplification
The reduced set method works on the assumption that SVM has already been trained by a training algorithm, and the task of the method is to retrain this machine. Moreover, it is required to retrain the simplified SVM and minimize a function of a variable (d+1) Nz (d is the order of the training vector). This is not an easy task especially when the number of reduced vectors Nz is large.