In recent years, machine-learning approaches for data analysis have been widely explored for recognizing patterns which, in turn, allow extraction of significant features within a large amount of data that often contains irrelevant detail. Learning machines comprise algorithms that may be trained to generalize. Trained learning machine algorithms may then be applied to predict the outcome in cases of unknown outcome. Machine-learning approaches, which include neural networks, hidden Markov models, belief networks, support vector and other kernel-based machines, are ideally suited for domains characterized by the existence of large amounts of data, noisy patterns and the absence of general theories.
To date, the majority of learning machines that have been applied to data analysis are neural networks trained using back-propagation, a gradient-based method in which errors in classification of training data are propagated backwards through the network to adjust the bias weights of the network elements until the mean squared error is minimized. A significant drawback of back-propagation neural networks is that the empirical risk function may have many local minima, a case that can easily obscure the optimal solution from discovery. Standard optimization procedures employed by back-propagation neural networks may converge to a minimum, but the neural network method cannot guarantee that even a localized minimum is attained, much less the desired global minimum. The quality of the solution obtained from a neural network depends on many factors. In particular, the skill of the practitioner implementing the neural network determines the ultimate benefit, but even factors as seemingly benign as the random selection of initial weights can lead to poor results. Furthermore, the convergence of the gradient-based method used in neural network learning is inherently slow. A further drawback is that the sigmoid function has a scaling factor, which affects the quality of approximation. Possibly the largest limiting factor of neural networks as related to knowledge discovery is the “curse of dimensionality” associated with the disproportionate growth in required computational time and power for each additional feature or dimension in the training data.
Kernel methods, based on statistical learning theory, are used for their conceptual simplicity as well as their remarkable performance. Support vector machines, kernel PCA (principal component analysis), kernel Gram-Schmidt, kernel Fischer discriminant, Bayes point machines, and Gaussian processes are just a few of the algorithms that make use of kernels for problems of classification, regression, density estimation and clustering. Kernel machines can operate in extremely rich feature spaces with low computational cost, in some cases accessing spaces that would be inaccessible to standard systems, e.g., gradient-based neural networks, due to their high dimensionality.
Kernel methods operate by mapping data into a high dimensional feature space then applying one of many available general-purpose algorithms suitable for work in conjunction with kernels. Put simply, the kernel virtually maps data into a feature space so that the relative positions of the data in feature space can be used as the means for evaluating, e.g., classifying, the data. The degree of clustering achieved in the feature space, and the relation between the clusters and the labeling to be learned, should be captured by the kernel.
Kernel methods exploit information about pairwise similarity between data points. “Similarity” is defined as the inner product between two points in a suitable feature space, information that can be obtained with little computational cost. The mapping into feature space is achieved in an implicit way: the algorithms are rewritten to need only inner product information between input points. The inner product is then replaced with a generalized inner product, or “kernel function”. This function returns the value of an inner product between feature vectors representing images of the inputs in some feature space.
While the learning module is general purpose, the kernel is problem specific. It is the kernel that makes it possible to effectively work in very rich feature spaces, provided the inner products can be computed. By developing algorithms that use only the inner products, it is possible to avoid the need to compute the feature vector for a given input. One of the key advantages to this approach is its modularity: the decoupling of algorithm design and statistical analysis from the problem of creating appropriate function/feature spaces for a particular application.
Defining the appropriate kernel function allows one to use a range of different algorithms to analyze the data while, at the same time, avoiding many practical prediction problems. It is crucial for the performance of a system that the kernel function fits the learning target in some way, i.e., that in the feature space, the data distribution is somehow correlated to the label distribution. Measuring the similarity between two kernels, or the degree of agreement between a kernel and a given target function is, therefore, an important problem.
For a given application, selection of a kernel corresponds to implicitly choosing a feature space since the kernel function is defined byk(x,z)=<φ(x),φ(z)>  (1)for the feature map φ. Given a training set S={x1, x2 . . . xm}, the information available to kernel-based algorithms is contained entirely in the matrix of inner productsG=K=(k(xi,xj))i,j−1m,  (2)known as the Gram matrix G or the kernel matrix K. This matrix encodes the similarity level between all pairs of data items induced by the kernel.
Kernels can be used without actually having the feature space F implicitly defined, as long as one can guarantee that such extra space exists, i.e., that the kernel can actually be regarded as an inner product of some space.
It is possible to characterize the kernel in many ways. One of the simplest is that a function k(x,z) is a valid kernel if and only if it always produces symmetric and positive definite Gram matrices for any finite set of data. Given an explicit feature map φ, Equation 1, above, can be used to compute the corresponding kernel. Often, however, methods are sought to directly provide the value of the kernel without explicitly computing φ. This enables one to use extremely rich features spaces, even infinite dimensional, at least from a computational perspective.
Starting with kernels K(x,z)=(x,z), one can define more complex kernels, the best known of which is the polynomial kernel. Given a kernel k, the polynomial construction creates a kernel {circumflex over (k)} by applying a polynomial with positive coefficients to k. For example,{circumflex over (k)}(x,z)=(k(x,z)+D)p,  (3)for fixed values of D and integer p. If the features space of k is F, then the feature space of {circumflex over (k)} is indexed by t-tuples of features from F, for t=0, 1 . . . , p. Hence, for a relatively small computational cost, the algorithms can be applied in a feature space of vastly expanded expressive power. Further, the example of the Gaussian kernel k can be considered:
                                              ⁢                                                            k                _                            ⁡                              (                                  x                  ,                  z                                )                                      =                          exp              ⁢                                                          ⁢                                                                    k                    ⁡                                          (                                              x                        ,                        x                                            )                                                        +                                      k                    ⁡                                          (                                              z                        ,                        z                                            )                                                        -                                      2                    ⁢                                          k                      ⁡                                              (                                                  x                          ,                          z                                                )                                                                                                              σ                  2                                                              ,                ⁢                                                      (        4        )            with a feature space of infinitely many dimensions. Other kernels include sigmoid, Bn-spline of odd order, and radial basis function (RBF) kernels, among others.
An important object in machine learning is minimization of expected risk in translating concepts from statistical learning theory into practical algorithms. Whether or not one has knowledge of the test patterns during training makes a significant difference in the design of learning algorithms. The difference is between minimizing test error in a specific test set versus minimizing expected error over all possible test sets. The problem of overall risk minimization is known as “transduction,” where the goal is to directly estimate the values of the unknown function for points of interest from the given data. This can be compared with the classical scheme of first using an inductive step to approximate the function then, using deduction, deriving the values of the given function for the points of interest. In the inductive/deductive method, the structural or specific test risk is minimized. With overall risk minimization provided by transduction, better generalization can be obtained. Unfortunately, transduction is very difficult to address, both computationally and conceptually.
Methods such as spectral graph theory (SGT) were introduced in the 1970's, with one of their main goals being to deduce the principal properties and structure of a graph comprising a plurality of nodes from its graph spectrum, where the graph spectrum is made up of the eigenvalues of the graph. It has recently been proposed that a graph theoretic, i.e., non-kernel based, approach to learning machines might be used to retrieve useful information from a dataset. In SGT, eigenvectors of a matrix are used to bisect, or partition, nodes in a graph corresponding to the matrix. To provide an example, from graph spectra, one can obtain information about the number of strongly connected components within the graph, the limiting distribution of a random walk on the graph and the time to reach it, etc. Applications of SGT are known and have been reported in many different disciplines, including chemistry, theoretical physics, quantum physics and communication networks.
Both kernel methods and graph theory provide certain advantages in the area of information extraction and learning from data, however, because of the very different approaches used by the two methods, they have heretofore not been combined. It is an object of the present invention to exploit the advantages of both kernel methods and spectral graph theory to solve problems of machine learning, including the problem of transduction.