A support vector machine (SVM) is often used for binary classification of data. The SVM can work with either a linear or a nonlinear kernel. The SVM constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. A good separation is achieved by the hyperplane that has a largest distance to the nearest training data if any, because in general, the larger the margin the lower the generalization error of the classifier. The radial basis function SVM is able to approximate any decision boundary between two classes as it lifts the separating hyperplane into a higher possibly infinite dimensional space.
However, the SVM does not scale well with the nonlinear kernel when the amount of training data is very large. Training nonlinear SVMs such as radial basis function SVM, can take days and testing unknown data with nonlinear SVMs can be prohibitively slow to incorporate into a real application. This is not the case if the linear kernel is used, especially when a dimensionality of the data is small.
If the kernel uses a Gaussian radial basis function, then the corresponding feature space is a Hilbert space of infinite dimension. In the case for nonlinear kernels, one must now employ N (number of support vectors) kernel evaluations to determine the projection of the unknown data onto the normal vector of the hyperplane in the reproducing kernel Hilbert space (RKHS). This is because one does not have a direct access to the RKHS, which might be infinite dimensional, but an indirect access through inner products provided by the kernel function.
One possible solution is to factor the kernel matrix and use the columns of the factor matrix as features with the linear kernel to avoid the computational complexity of the nonlinear kernel. Classification of unknown data with the linear kernel is fast because one only needs to project the normal vector of the separating hyperplane between two data classes.
Another solution is to approximate the kernel in a space of Fourier random, data blind features as an inner product of the transformed data, wherein one can train the linear kernel to these features. It is known that with this approach, the necessary number of Fourier features is relatively small, i.e., the dimensionality of the space of Fourier features is low, and training with the linear kernel is considerably faster that training with the nonlinear kernel. However, the trained SVM is blind, cannot use any data priors, and the classification performance is limited.
For some applications, the computational complexity of classification is extremely high. Therefore, it is desired to reduce the time for data classification while achieving a high classification performance as nonlinear kernels.