Given a dataset S=( x1, y1), ( x2, y2), . . . , ( xn, yn) of n points in d dimensions, the inductive learning task is to build a function ƒ( x) that, given a new point x, can predict the associated y value. A common framework for solving such problems is Tikhonov regularization (Evgeniou et al., 2000), in which the following minimization problem is solved:
                              min                                                  f            ∈            H                                ⁢          1      n        ⁢                  ∑                  i          =          1                n            ⁢              V        ⁡                  (                                    f              ⁡                              (                                                      x                    _                                    i                                )                                      ,                          y              i                                )                      +      λ    ⁢                          f                    K      2      Here, H is a Reproducing Kernel Hilbert Space or RKHS (Aronszajn, 1950) with associated kernel function K, ∥ƒ∥K2 is the squared norm in the RKHS, λ is a regularization constant controlling the tradeoff between fitting the training set accurately and forcing smoothness of ƒ (in the RKHS norm), and V(ƒ( x), y) is a loss function representing the price paid when x is seen and ƒ( x) is predicted, but the actual associated value is y. Different choices of V give rise to different learning schemes (Evgeniou et al., 2000). In particular, the hinge loss V(ƒ( x), y) max (1−yƒ( x),0) induces the well-known support vector machine, also referred to as SVM (Vapnik, 1998), while the square loss V(ƒ( x), y)=(ƒ( x)−y)2 induces a simple regularized least squares classifier (Wahba, 1990).
For a wide range of loss functions, including the square loss, the so-called Representer Theorem proves that the solution to the Tikhonov minimization problem will have the following form (Schölkopf et al., 2001; Wahba, 1990):
            f      s        ⁡          (              x        _            )        =            ∑              i        =        1            n        ⁢                  c        i            ⁢              K        ⁡                  (                                    x              _                        ,                                          x                _                            i                                )                    The Representer Theorem reduces the infinite dimensional problem of finding a function in an RKHS to the n-dimensional problem of finding the coefficients ci. For Regularized Least Squares (RLS), the loss function is differentiable, and simple algebra shows that the coefficients c can be found by solving the linear system (K+λnI)c=y, where, via a commonly used and accepted notational short-cut, K is the n by n matrix satisfying Kij=K( xi, xj).
The RLS algorithm has several attractive properties. It is conceptually simple, and can be implemented efficiently in a few lines of MATLAB code. It places regression and classification in exactly the same framework (i.e., the same code is used to solve regression and classification problems). Intuitively, it may seem that the squared loss function, while well-suited for regression, is a poor choice for classification problems when compared to the hinge loss. In particular, given a point ( x, y=1), the square loss penalizes ƒ( x)=10 and ƒ( x)=−9 equally, while the SVMs hinge loss penalizes only the latter choice. However, across a wide range of problems, the regularized least squares classifier (RLSC) performs as well as the SVM (Rifkin, 2002; Rifkin & Klautau, 2004).
On the other hand, for large datasets and non-linear kernels, RLS has a serious problem as compared to the more popular SVM. In more detail, direct methods for solving RLS manipulate the entire kernel matrix and therefore require O(n3) time and (even worse) O(n2) space. These problems can be alleviated by using iterative methods such as conjugate gradient, which require only matrix-vector products. However, recomputing the kernel matrix at each iteration requires O(n2d) work, which can be prohibitive if d is large. The SVM, on the other hand, is quite attractive computationally, as the flat loss function for correctly classified points induces a sparse solution in which all the correctly classified points have zero coefficients ci. Kernel products between pairs of training points both of which are correctly classified at all times during training are never computed by state-of-the-art SVM algorithms. In practice, SVM algorithms generate only a small fraction of the kernel matrix, making SVMs much more suited than RLS to large non-linear problems.
In more detail, when a linear kernel K( xi, xj)= xi· xj is used, the relative effectiveness of SVMs and RLSCs is reversed. Solving an RLSC problem for c becomes an O(nd2) time and O(nd) memory problem, either directly using the Sherman-Morrison-Woodbury formula or iteratively via conjugate gradient (Golub & Van Loan, 1996). The SVM is also somewhat faster, but the difference is not as dramatic, because state-of-the-art SVM algorithms are coordinate ascent algorithms that work at the boundary of the feasible region, optimizing a few coefficients while holding the rest fixed, and therefore require explicit entries of the kernel matrix K. Some attempts at interior point SVMs, which can be represented in terms of matrix-vector products and could therefore take full advantage of the savings offered by a linear kernel, have been made (Fine & Scheinberg, 2001). However, these approaches have not yet achieved the same performance as state-of-the-art linear SVMs, and RLS classification remains the fastest way to train a regularized linear classifier.
What is needed, therefore, are techniques that make using regularized least squares more practical.