Many problems in information processing involve the selection or ranking of items in a large data set. For example, a search engine that locates documents meeting a search query must often select items from a large result set of documents to display to the user, and often must rank those documents based on their relevance or other criteria. Similar exercises of selecting a set of items from a larger set of possible items are undertaken in other fields such as weather pattern prediction, analyzing commercial markets, and the like. Some complex mathematical models and processes have been developed to perform this analysis.
One such set of processes, Bayesian Gaussian processes, provide a probabilistic kernel approach to supervised learning tasks. The advantage of Gaussian process (GP) models over non-Bayesian kernel methods, such as support vector machines, comes from the explicit probabilistic formulation that yields predictive distributions for test instances and allows standard Bayesian techniques for model selection. The cost of training GP models is O(n3), where n is the number of training instances, which results in a huge computational cost for large data sets. Furthermore, when predicting a test case, a GP model requires O(n) cost for computing the mean and O(n2) cost for computing the variance. These heavy scaling properties obstruct the use of GPs in large scale problems.
Sparse GP models bring down the complexity of training as well as testing. The Nystrom method has been applied to calculate a reduced rank approximation of the original n×n kernel matrix. One on-line algorithm maintains a sparse representation of the GP models. Another algorithm uses a forward selection scheme to approximate the log posterior probability. Another fast and greedy selection method builds sparse GP regression models. All of these attempt to select an informative subset of the training instances for the predictive model. This subset is usually referred to as the set of basis vectors, denoted as I. The maximal size of I is usually limited by a value dmax. As dmax<<n, the sparseness greatly alleviates the computational burden in both training and prediction of the GP models. The performance of the resulting sparse GP models depends on the criterion used in the basis vector selection.
It would be desirable to provide a system and method for greedy forward selection for sparse GP models. Accordingly, there's a need for a system and method that yield better generalization performance, while essentially not effecting algorithm complexity. The preferred embodiments of the system and method described herein clearly address this and other needs.