Data modeling has become an important tool in solving complex and large real-world computerizable problems. Applications of data modeling include data compression, density estimation and data visualization. A data modeling technique used for these and other applications is probabilistic modeling. It has proven to be a popular technique for data modeling applications such as speech recognition, vision, handwriting recognition, information retrieval and intelligent interfaces. One framework for developing such applications involves the representation of probability distributions as directed acyclic graphs, which are also known as Bayesian networks, belief networks, and probabilistic independence networks, among other terms.
In probabilistic modeling, usually a training data set is given that includes input vectors {xn}n=1N along with a set of corresponding targets {tn}n=1N, the latter of which can be real values, in the case of regression analysis, or class labels, in the case of classification analysis. From this training set, a model of p(t|x) is attempted to be inferred, with the object of making accurate predictions of t for new, unlabelled, examples of x. Generally, the principal challenge is to find the appropriate complexity of this model. Scoring alternative models by training set accuracy alone is usually undesirable, since increasing the model complexity, while reducing the training set error, can easily lead to over-fitting and poor generalization. A more robust approach is to introduce a prior distribution over models, which is used in conjunction with the information supplied by the training data to infer the prediction model. This prior distribution, also referred to as a prior, can be explicit, such as in a Bayesian framework, or can be implicit in other approaches.
One method for classification, that has also been extended to regression, is known as the support vector machine (SVM). Although it does not estimate p(t|x), it makes predictions based on a discriminant function of the form             y      ⁡              (        x        )              =                            ∑                      n            =            1                    N                ⁢                                   ⁢                              w            n                    ⁢                      K            ⁡                          (                              x                ,                                  x                  n                                            )                                          +              w        0              ,where {wn} are the model weights and K(•,•) is a kernel function. A feature of the SVM is that its cost function attempts to minimize the number of errors made on the training set while simultaneously maximizing the margin between the two classes, in the feature space implicitly defined by the kernel. This maximum-margin principle is an appealing prior for classification, and ultimately drives many of the weights to zero, resulting in a sparse kernel classifier where the non-zero weights are associated with xn, that are either on the margin or lie on the wrong side of it. Model complexity is thus constrained such that only these support vectors determine the decision function. In practice, in addition to fitting the model to the training data, it is also necessary to estimate the parameter (usually, denoted C) which regulate the trade-off between the training errors and size of margin, which may entail additional cross-validation.
A disadvantage with the SVM as a general matter is that it utilizes many kernel functions, and may not yield as optimal test performance as may be desired. Furthermore, the SVM utilizes parameters (i.e., those denoted C), which add unwanted complications to the model. To address these concerns, the copending and coassigned patent application entitled “Relevance Vector Machine,” filed on Sep. 4, 1999, and assigned Ser. No. 09/391,093, describes a Relevance Vector Machine (RVM) that utilizes a functional form that is equivalent to the SVM, but which is a probabilistic model. It achieves comparable recognition accuracy to the SVM, but advantageously provides a full predictive distribution, and requires substantially fewer kernel functions. As described in this prior application, the RVM relied on the use of type II maximum likelihood, referred to as the evidence framework, to generate point estimates of the hyperparameters that govern model sparsity. However, because analysts desire to have different approaches, techniques and tools to solve a given model, there is a motivation for the present invention. Furthermore, the approach described here provides a closer a approximation to a fully Bayesian treatment than has been possible previously, and this is expected to be advantageous for problems involving data sets of limited size.