Numerous machine-learning approaches have been explored for recognizing patterns. Such systems have been used for a variety of applications including target recognition, speech recognition and optical character recognition.
A machine or system is said to learn if, given a few examples of a class of patterns, it learns to generalize how to recognize other members of the class from those few examples. This is similar to how people learn. For instance, a child shown a number examples of a chair can, from those few examples, generalize the concept of a chair and so identify many different types of chair. Machine-learning approaches, which include neural networks, hidden Markov models, belief networks, support vector and other kernel-based machines, are ideally suited for domains characterized by the existence of large amounts of data, noisy patterns and the absence of general theories.
The majority of learning machines that have been applied to data analysis are neural networks trained using back-propagation. This is a gradient-based method in which errors in classification of training data are propagated backwards through the network to adjust the bias weights of the network elements until a mean squared error is minimized. A significant drawback of back-propagation neural networks is that the empirical risk function may have many local minima, i.e., a case that can easily obscure the optimal solution from discovery. Standard optimization procedures employed by back-propagation neural networks may converge to a minimum, but the neural network method cannot guarantee that even a localized minimum is attained, much less the desired global minimum. The quality of the solution obtained from a neural network depends on many factors, and particularly, on the skill of the practitioner implementing the neural network. Even seemingly benign factors, such as the random selection of initial weights, can lead to poor results. Furthermore, the convergence of the gradient-based method used in neural network learning is inherently slow. A further drawback is that the sigmoid function typically used as the transfer function between the inputs and outputs of each neuron in the network has a scaling factor that, unless carefully chosen, may significantly affect the quality of approximation. Possibly the largest limiting factor of neural networks as related to knowledge discovery is the “curse of dimensionality” associated with the disproportionate growth in required computational time and power for each additional feature or dimension in the training data.
Largely because of these shortcomings of neural networks, more recent work on machine learning has tended to focus on kernel methods. Kernel methods, based on statistical learning theory, are used for their conceptual simplicity as well as their remarkable performance. Support vector machines, kernel PCA (principal component analysis), kernel Gram-Schmidt, kernel Fischer discriminate, Bayes point machines, and Gaussian processes are just a few of the algorithms that make use of kernels for problems of classification, regression, density estimation and clustering. Kernel machines can operate in extremely rich feature spaces with low computational cost, in some cases accessing spaces that would be inaccessible to standard systems, e.g., gradient-based neural networks, due to their high dimensionality.
Kernel methods typically operate by mapping data into a high dimensional feature space, and then applying one of many available general-purpose algorithms suitable for work in conjunction with kernels. The kernel virtually maps data into a feature space so that the relative positions of the data in feature space can be used as the means for evaluating, e.g., classifying, the data. The degree of clustering achieved in the feature space, and the relation between the clusters and the labeling to be learned, should be captured by the kernel.
Kernel methods exploit information about pairwise similarity between data points. “Similarity” may be defined as the inner product between two points in a suitable feature space, information that can be obtained with little computational cost. The mapping into feature space may be achieved in an implicit way, i.e., the algorithms are rewritten to need only inner product information between input points. The inner product may then replaced with a generalized inner product, or “kernel function”. This function returns the value of an inner product between feature vectors representing images of the inputs in some feature space.
While the kernel machine learning module is general purpose, the kernel itself is problem specific. It is the kernel that makes it possible to effectively work in very rich feature spaces, provided the inner products can be computed. By developing algorithms that use only the inner products, it is possible to avoid the need to compute the feature vector for a given input. Each application of a kernel machine, therefore, typically requires developing specific new algorithms to make the learning module work.
A machine learning method in which the quality of the solution is not dependent on the user's prior experience, that is general purpose and does not need algorithms that are specific to an application would be of great use in a variety of recognition and classifications problems including, but not limited to, automatic pattern and target recognition including biometric face recognition, video tagging and video searching applications.