This invention relates generally to the field of classifiers and in particular to a trainer for classifiers employing maximum margin back-propagation with probability outputs.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawing hereto: Copyright(copyright)1999, Microsoft Corporation, All Rights Reserved.
The problem of determining how to classify things is well known. Humans have an innate ability to classify things quickly. Many intelligence tests are based on this ability. For a human, it is quite simple to determine whether a coin is a penny, a nickel or other denomination. We can tell whether or not an email is junk mail or spam, which we don""t want to read, or if it is an email from a friend or customer, that we need to read quickly. However, many humans do not have a comprehensive knowledge of how to determine whether the word xe2x80x9caffectxe2x80x9d or xe2x80x9ceffectxe2x80x9d should be used when writing a sentence. There have been many attempts to program computers to classify things. The math involved can get quite complex. It was difficult early on to get a computer to classify coins, but neural networks, which were trained with many sets of input could do a pretty good job. The training for such simple classifiers was fairly straight forward due to the simplicity of the classifications.
However, classifying things as junk email or spam is much more difficult, and involves the use of complex mathematical functions. The rest of this background section describes some prior attempts at training classifiers. They involved the selection and modification of parameters for functions that compute indications of whether or not a particular input is likely in each of multiple categories. They involve the use of training sets, which are fed into classifiers, whose raw output is processed to modify the parameters. After several times through the training set, the parameters converge on what is thought to be a good set of parameters that when used in the classifier on input not in the training set, produces good raw output representative of the proper categories.
This raw output however, might not be directly related to the percentage likelihood that a particular category is the right one. This has lead to the use of a further function, called a probability transducer, that takes the raw output, and produces a probability for each category for the given input. There are many problems that remain with such systems. It is difficult to produce one that provides good performance, and does not over fit to the training data. Over fitting to the training data can produce wrong results, unless a very large amount of training data is provided. This, however, severely impacts performance.
A standard back-propagation architecture is shown in prior art FIG. 1. Functions used in this description are indicated by numbers in parentheses for ease of reference. Back-propagation is well-known in the art. It is described fully in section 4.8 in the book xe2x80x9cNeural Networks for Pattern Recognitionxe2x80x9d by Christopher M. Bishop, published by Oxford Press in 1995. The purpose of a back-propagation architecture is to train a classifier 105 (and associated probability transducer 125) to produce probabilities of category membership given an input. The classifier 105 is trained by example. A training set comprising training inputs 100 is provided to the classifier 105. The classifier 105 has parameters 110 which are like adjustable coefficients for functions used to calculate raw outputs 120 indicative of class membership. The parameters 110 are adjusted so that the final probability outputs 135 are close to a set of desired outputs 145 for the training set. Given enough training inputs 100 and associated desired outputs 145, and a sensible choice of classifier 105, the overall system will generalize correctly to inputs that are not identical to those found in training inputs 100. In other words, it will produce a high probability that the input belongs in the correct category.
The functional form of classifier 105 can take many forms. The output 120 of the classifier 105 can be a single value, if the classifier is designed to make a single category probability. Alternatively, the output 120 of the classifier 105 can be a plurality of values, if the classifier is designed to produce probabilities for multiple categories. For purposes of discussion, the output 120 of a classifier 105 is denoted by a vector yi, regardless of whether output 120 is a single output or a plurality of outputs. Similarly, one example of training input is denoted by a vector xj, regardless of the true dimensionality of the training input.
Back-propagation is applicable to any classifier 105 that has a differentiable transfer function that is parameterized by one or more parameters 110. One typical functional form of classifier 105 is linear:                               y          i                =                                            ∑              j                        ⁢                          xe2x80x83                        ⁢                                          w                ij                            ⁢                              x                j                                              -                      b            i                                              (        1        )            
where wij and bi are the parameters 110, j is the number of classes, and there are no internal state variables 115. Another typical functional form of classifier 105 is a linear superposition of basis functions:                               y          i                =                                            ∑              k                        ⁢                                          w                ik                            ⁢                                                ϕ                  k                                ⁡                                  (                                                            x                      ρ                                        ,                                                                  θ                        μ                                            k                                                        )                                                              -                      b            i                                              (        2        )            
where the Wik, bi, and xcex8k are the parameters 110, and the "PHgr"k are the internal state of parameters 115. The parameterized functions "PHgr"k can take on many different forms, such as Gaussians or logistic functions, or multi-layer perceptrons, as is well-known in the art (see chapters 4 and 5 of the Bishop book). Other possible forms (e.g., convolutional neural networks) are known in the art.
The desired output 120 of a classifier 105 is a probability of a category 135. These probability outputs are denoted as pi, again regardless of whether there is one probability value or a plurality of probability values. Typically, a probability transducer 125 is applied to the output 120 of the classifier 105 to convert it into a probability. When there is only one output or when the outputs are not mutually exclusive categories, a sigmoidal (or logit) function is typically used:                               p          i                =                  1                      1            +                          ⅇ                                                                    A                    i                                    ⁢                                      y                    i                                                  +                                  B                  i                                                                                        (        3        )            
where Ai is typically fixed to xe2x88x921, while Bi is typically fixed to 0. The parameters Ai and Bi are the fixed parameters 130. When a logit probability transducer is coupled to a linear classifier, the overall system performs a classical logistic regression. When the classifier 105 is deciding between a plurality of mutually exclusive categories, a softmax or winner-take-all function is often used:                               p          i                =                              ⅇ                                                            A                  i                                ⁢                                  y                  i                                            +                              B                i                                                                        ∑              j                        ⁢                          ⅇ                                                                    A                    j                                    ⁢                                      y                    j                                                  +                                  B                  j                                                                                        (        4        )            
where Ai is typically fixed to +1 and Bi is typically fixed to 0. Again, the parameters Ai and Bi are the fixed parameters 130.
The back-propagation architecture will attempt to adjust the parameters of the classifier 110 to cause the probability outputs 135 to be close to the desired outputs 145. The values of the desired outputs 145 are denoted as ti. The values of the desired outputs 145 are typically (although not always) either 0 or 1, depending on whether a specific training input 100 belongs to the category or not.
The closeness of the probability output 135 to the desired output 145 is measured by an error metric 140. The error metric computes an error function for one particular training input, whose corresponding parameters and values are denoted with a superscript (n). The error metric computes a function E(n)(pi(n), ti(n)) which is at a minimum when pi(n) matches ti(n). The output of the error metric is actually an error gradient 150:                               g          i                      (            n            )                          =                              ∂                          E                              (                n                )                                                          ∂                          p              i                              (                n                )                                                                        (        5        )            
The are many error metrics used in the prior art. For example, the squared error can be used:                               E                      (            n            )                          =                              1            2                    ⁢                                    ∑              i                        ⁢                          xe2x80x83                        ⁢                                          (                                                      p                    i                                          (                      n                      )                                                        -                                      t                    i                                          (                      n                      )                                                                      )                            2                                                          (        6        )            
or the cross-entropy score for use with probability transducer (3):                               E                      (            n            )                          =                              -                                          ∑                i                            ⁢                                                t                  i                                      (                    n                    )                                                  ⁢                                  log                  ⁡                                      (                                          p                      i                                              (                        n                        )                                                              )                                                                                +                                    (                              1                -                                  t                  i                                      (                    n                    )                                                              )                        ⁢                          log              ⁡                              (                                  1                  -                                      p                    i                                          (                      n                      )                                                                      )                                                                        (        7        )            
or the entropy score for use with probability transducer (4):                               E                      (            n            )                          =                  -                                    ∑              i                        ⁢                                          t                i                                  (                  n                  )                                            ⁢                              log                ⁡                                  (                                      p                    i                                          (                      n                      )                                                        )                                                                                        (        8        )            
The use of error metric (7) combined with probability transducer (3) or error metric (8) combined with probability transducer (4) is that the output of the probability transducer is trained to be a true posterior probability estimate of category given input data.
Previously, researchers such as Sontag and Sussman in xe2x80x9cBack propagation separates where Perceptrons doxe2x80x9d in the journal Neural Networks, volume 4, pages 243-249, (1991), have suggested using a margin error metric as error metric 140. The gradient (5) of a margin error metric is shown in FIG. 2. A margin error metric is defined as, for positive examples in the category, having a negative gradient below an output level M+ and an exactly zero gradient above M+, while for negative examples out of the category, having a positive gradient above an output level Mxe2x88x92 and an exactly zero gradient below Mxe2x88x92. The threshold M+ must be strictly greater than the threshold Mxe2x88x92. A margin error metric was originally proposed by Sontag and Sussmann to ensure that data sets that are linearly separable would be cleanly separated by a back-propagation algorithm. The disadvantage of such a margin error metric is that the outputs are no longer probabilities.
The gradient computation 155 computes the partial derivative of the error E with respect to all of the parameters of the classifier by using the chain rule. The gradient computation 155 uses the gradient of the error 150, the probability outputs 135, the parameters 110, and any internal state 115 of the classifier 105. The computation of the gradient with back-propagation is well-known in the art: see section 4.8 of Bishop""s book. The output of the gradient computation is denoted as Gi(n), where i is the index of the ith free parameter 110 of the classifier 105 and n is the index into the training set.
Once the Gi(n) are computed, the parameters 110 should be updated to reduce the error. The update rule 160 will update the parameters 110. There are any number of updating rules that cause the error to be reduced. One style of update rule 160 updates the parameters 110 every time a training input 100 is presented to classifier 105. Such update rules are called xe2x80x9con-line.xe2x80x9d Another style of update rule computes the true gradient of the error over the entire training set with respect to free parameters 110, by summing the Gi(n) over the index n of all training inputs 100, then updating the parameters after the sum is computed. Such update rules are called xe2x80x9cbatchxe2x80x9d.
One example of an update rule is the stochastic gradient descent rule, where the parameters 110 are adjusted by a small step in the direction that will improve the overall error. We will denote the ith parameter 110 of classifier 105 as xcex3i:
xcex3i←xcex3ixe2x88x92xcex7Gi(n)xe2x80x83xe2x80x83(9)
The step size xcex7 can be held constant, or decrease with time, as is well-known in the art. The convergence of stochastic gradient descent can be improved via momentum, as is well-known in the art:                                           β            i                    ←                                    αβ              i                        +                                          (                                  1                  -                  α                                )                            ⁢              η              ⁢                              xe2x80x83                            ⁢                              G                i                                  (                  n                  )                                                                    ⁢                  
                ⁢                              γ            i                    ←                      y            i                    ←                      β            i                                              (        10        )            
Either stochastic gradient descent or stochastic gradient descent with momentum can be used in either on-line or batch mode. In batch mode, the term Gi(n) is replaced with a sum of Gi(n) over all n. Other numerical algorithms can be used in batch mode, such as conjugate gradient, pseudo-Newton, or Levenberg-Marquardt. Chapter 7 of Bishop""s book describes many possible numerical algorithms to minimize the error.
Simply minimizing the error on the training set can often lead to paradoxically poor results. The classifier 105 will work very well on inputs that are in the training set, but will work poorly with inputs that are away from the training set. This is known as overfitting: by minimizing only the error on the training set, the error off the training set is only indirectly minimized, and could be high.
There are many algorithms to avoid overfitting. One very simple algorithm is to penalize parameters 110 that have large value (see Bishop, section 9.2). This algorithm is known as weight decay. Weight decay 165 is shown in FIG. 1: it uses the current values of the parameters to modify the update rule.
The reasoning behind weight decay is that the correct classifier should be the most likely classifier given the data. According to Bayes"" rule, the posterior probability of classifier given data is proportional to the probability of the data given the classifier (the likelihood) multiplied by the prior probability of the classifier. The error on the training set is commonly interpreted as a log likelihood: to account for the prior, a log prior must be added to the error. If the prior probability over parameters 110 is a Gaussian with mean zero, then the log prior is a quadratic penalty on the parameters 110. The derivative of a quadratic is linear, so a Gaussian prior over the parameters 110 adds a linear term to the update rule. With weight decay, the update rule (9) becomes:
xcex3i←(1xe2x88x92xcex5)xcex3ixe2x88x92xcex7Gixe2x80x83xe2x80x83(11)
while update rule (10) becomes:                                           β            i                    ←                                    αβ              i                        +                                          (                                  1                  -                  α                                )                            ⁢                              (                                                      η                    ⁢                                          xe2x80x83                                        ⁢                                          G                      i                                                        +                                      ϵγ                    i                                                  )                                                    ⁢                  
                ⁢                              γ            i                    ←                                    y              i                        -                          β              i                                                          (        12        )            
People of ordinary skill in the art will understand how to modify an update rule (such as conjugate gradient) to reflect a weight decay term.
The architecture in FIG. 1 is limited, because the generalization error for data not in the training set is not directly minimized. Section 10.1 of the book xe2x80x9cStatistical Learning Theoryxe2x80x9d by Vladimir Vapnik (published by Wiley Inter-science in 1998) proposes using a support vector machine: a linear classifier (1) with only one output value trained via the following quadratic programming problem:                                           min            ⁢                                          ∑                j                            ⁢                              w                j                2                                              +                      C            ⁢                                          ∑                n                            ⁢                              ξ                                  (                  n                  )                                                                    ⁢                  
                ⁢                                            subject              ⁢                              xe2x80x83                            ⁢              to              ⁢                              xe2x80x83                            ⁢                              y                                  (                  n                  )                                            ⁢                              T                                  (                  n                  )                                                      ≥                          1              -                              ξ                                  (                  n                  )                                                              ,                                    ξ                              (                n                )                                      ≥            0                                              (        13        )            
where wj is the weight of the input Xj contributing to the single output, xcex3(n) is the output value of the classifier when the input is the nth training example; while T(n) is the desired output of the classifier for the nth training example, T(n) is +1 for positive examples in the category and xe2x88x921 for negative examples out of the category. Vapnik proposes this quadratic programming problem in order to directly minimize a bound on the error of the classifier on inputs not in the training set. When the quadratic optimization problem (13) is solved, then the weights wi and the threshold b are the optimal hyperplane that splits inputs in the category from inputs out of the category.
The architecture to solve the constrained optimization problem (13) is shown in prior art FIG. 3, where the training inputs 100 are fed to a classifier 105 to produce raw outputs 120. The input 100 and outputs 120 and desired outputs 145 are all provided to a quadratic programming solver 200, which updates the parameters 110.
Section 10.5 of Vapnik""s book also describes extension of the quadratic programming problem (13) to non-linear support vector machines. Those extensions are limited to those non-linear classifiers of form (2) whose basis functions "PHgr" obey a set of conditions, called Mercer""s conditions. Such a non-linear extension has the same architecture as the linear case, as shown in FIG. 3: the matrix used in the quadratic programming 200 is different than the linear case.
Sections 7.12 and 11.11 of Vapnik""s book describe a method for mapping the output of a classifier trained via constrained optimization problem (13) into probabilities, using an architecture also shown in FIG. 3. After the parameters 100 of the classifier 105 are completely determined, the raw outputs 120 of the final classifier are fed to a probability transducer 125. The probability transducer suggested by Vapnik is a linear blend of cosines. The blending coefficients are the parameters of the probability transducer 210. These parameters are determined from statistical measurements of the raw outputs 120 and the desired outputs 145 involving Parzen window kernels (see Vapnik, sections 7.11 and 7.12).
Vapnik""s support vector machine learning architecture produces classifiers that perform very well on inputs not in the training set. However, only a limited array of non-linear classifiers can be trained. Also, the number of non-linear basis functions "PHgr" used in a support vector machine tends to be much larger than the number of basis functions used in a back-propagation network, so that support vector machines are often slower at run time. Also, the probability transducer proposed by Vapnik may not yield a probability function that is monotonic with raw output, which is a desirable feature. Nor are the probabilities constrained to lie between 0 and 1, nor sum to 1 across all classes.
Prior art FIG. 4 shows another example of prior art disclosed by Denker and LeCun in the paper xe2x80x9cTransforming Neural-Net Output Levels to Probability Distributionsxe2x80x9d, which appears in the Advances in Neural Information Processing Systems conference proceedings, volume 3, pages 853-859. Here, a classifier 105 is trained using the standard architecture shown in FIG. 1. Then, a new probability transducer is trained from a calibration set of input data, separate from the training set. Denker and LeCun then suggest that the probability transducer 125 has its parameters 210 determined by statistical measurements 225. In this case, the statistical measurements need to know the gradient and the Hessian of the error metric 140 with respect to the parameters 110. These measurements then determine the parameters of the new probability transducer, which is either a Parzen window non-parametric estimator or a softmax function (4) with parameters Ai and Bi determined from the statistical measurements 225.
The limitation of the Denker and LeCun method using Parzen windows is that Parzen windows take a lot of memory and may be slow at run-time and may not yield a probability output that is monotonic in the raw output. The problem with the softmax estimator with parameters derived from the statistical measurements 225 is that the assumptions that derive these parameters (that the outputs for every category are spherical Gaussians with equal variance) may not be true, hence the parameters may not be optimal.
There is a need for a learning architecture that is as applicable as back-propagation to many classifier functional forms, while still providing the mathematical assurance of support vector machines of good performance on inputs not in the training set. There is a further need for a system that yields true posterior probabilities that are monotonic in the raw output of the classifier and that does not require a non-parametric probability transducer. There is a need for a training system that minimizes a bound on expected testing error, which means that it avoids overfitting.
A training system for a classifier utilizes both a back-propagation system to iteratively modify parameters of functions which provide raw output indications of desired categories, wherein the parameters are modified based on weight decay, and a probability determining system with further parameters that are determined during iterative training. The raw output of the back-propagation system provide-either a high or low indication for each category for a given input, while the parameters for the probability determining system, combined with the raw outputs, are used to determine the percentage likelihood of each classification for a given input.
In one aspect of the invention, the back-propagation system uses a margin error metric combined with weight decay. The probability determining system uses a sigmoid to calibrate the raw outputs to probability percentages for each category.
The training system may be used with multiple applications in order to determine the probability that some item fits in one of potentially several categories, or just one category. One practical application involves identifying email as spam or junk email versus email people want to receive, or in grammar checking programs such as determining whether the word xe2x80x9caffectxe2x80x9d or xe2x80x9ceffectxe2x80x9d is correct based on the preceding 100 words.
A method of training such a system involves gathering a training set of inputs and desired corresponding outputs. Classifier parameters are then initialized and an error margin is calculated with respect to the classifier parameters. A weight decay is then used to adjust the parameters. After a selected number of times through the training set, the parameters are deemed in final form, and an optimization routine is used to derive a set of probability transducer parameters for use in calculating the probable classification for each input.
The training system minimizes a bound on expected testing error, which means that it avoids overfitting so that training points which are outside normal ranges do not adversely affect the parameters. It provides a compact function with a minimal number of parameters for a given accuracy. In addition, the system provides simultaneous training of all the parameters for all the classes as opposed to prior systems which required individual training of parameters. The result is a simple and fast to evaluate probabalistic output that is monotonic.