This invention relates generally to data modeling and analysis, and more particularly to a relevance vector machine for such data modeling and analysis.
Data modeling has become an important tool in solving complex and large real world computerizable problems. Applications of data modeling include data compression, density estimation and data visualization. A data modeling technique used for these and other applications is probabilistic modeling. It has proven to be a popular technique for data modeling applications such as speech recognition, vision, handwriting recognition, information retrieval and intelligent interfaces. One framework for developing such applications involves the representation of probability distributions as directed acyclic graphs, which are also known as Bayesian networks, belief networks, and probabilistic independence networks, among other terms.
In modeling such as probabilistic, usually a training data set is given that includes input vectors       {          x      n        }        n    =    1    N
along with a set of corresponding targets             {              t        n            }              n      =      1        N    ,
the latter of which can be real values, in the case of regression analysis, or class labels, in the case of classification analysis. From this training set, a model of p(t|x) is attempted to be inferred, with the object of making accurate predictions of t for new, unlabelled, examples of x. Generally, the principal challenge is to find the appropriate complexity of this model. Scoring alternative models by training set accuracy alone is usually undesirable, since increasing the model complexity, while reducing the training set error, can easily lead to over-fitting and poor generalization. A more robust approach is to introduce a prior distribution over models, which is used in conjunction with the information supplied by the training data to infer the prediction model. This prior distribution, also referred to as a prior, can be explicit, such as in a Bayesian framework, or can be implicit in other approaches.
One method for classification, that has also been extended to regression, is known as the support vector machine (SVM). Although it does not estimate p(t|x), it makes predictions based on a discriminant function of the form             y      ⁢              (        x        )              =                            ∑                      n            =            1                    N                ⁢                  xe2x80x83                ⁢                              w            n                    ⁢                      K            ⁢                          (                              x                ,                                  x                  n                                            )                                          +              w        0              ,
where {wn} are the model weights and K(xc2x7,xc2x7) is a kernel function. A feature of the SVM is that its cost function attempts to minimize the number of errors made on the training set while simultaneously maximizing the margin between the two classes, in the feature space implicitly defined by the kernel. This maximum-margin principle is an appealing prior for classification, and ultimately drives many of the weights to zero, resulting in a sparse kernel classifier where the non-zero weights are associated with xn that are either on the margin or lie on the wrong side of it. Model complexity is thus constrained such that only these support vectors determine the decision function. In practice, in addition to fitting the model to the training data, it is also necessary to estimate the parameters (usually, denoted C) which regulate the trade-off between the training errors and size of margin, which may entail additional cross-validation.
A disadvantage with the SVM as a general matter is that it utilizes many kernel functions, and may not yield as optimal test performance as may be desired. Furthermore, the SVM utilizes parameters (i.e., those denoted C), which add unwanted complexity to the model. For these and other reasons, there is a need for the present invention.
The invention relates to a relevance vector machine (RVM). The RVM is a probabilistic basis model of the same functional form of the SVM. Sparsity is achieved through a Bayesian treatment, where a prior is introduced over the weights governed by a set of what are referred to as hyperparametersxe2x80x94one such hyperparameter associated with each weight, whose most probable values are iteratively estimated from the data. The posterior distribution of many of the weights is sharply peaked around zero, in practice.
In one embodiment, a computer-implemented method includes inputting a data set to be modeled, and determining a relevance vector learning machine to obtain a posterior distribution over the learning machine parameters given the data set (also referred to as xe2x80x9cthe posteriorxe2x80x9d). This includes determining a marginal likelihood for the hyperparameters, and iteratively re-estimating the hyperparameters to optimize the marginal likelihood. For the case of regression analysis, the marginal likelihood is determined directly. For the case of classification analysis, the marginal likelihood is approximated through the additional determination of the most probable weights for the given hyperparameters, and the Hessian at that most probable weight value. This approximation is also iteratively redetermined as the hyperparameters are updated. At least the posterior distribution for the weights given the data set is then output by the method.
RVM has advantages not found in prior art approaches such as SVM. As compared to SVM, for example, the non-zero weights in the RVM have been seen to not be associated with examples close to the decision boundary, but rather appear to represent more prototypical examples of classes. These examples are termed relevance vectors. Generally, the trained RVM utilizes many fewer basis functions than the corresponding SVM, and typically superior test performance. Furthermore, no additional validation of parameters (such as C) is necessary to specify the model, save those associated with the basis.