Machine learning is the study of computer algorithms that improve automatically through experience. Applications range from data-mining programs that discover general rules in large data sets, to information filtering systems that automatically learn users' interests. Due to the advent of global communications networks such as the Internet, electronic messaging, and particularly electronic mail (“email”), is becoming increasingly pervasive as a means for disseminating unwanted advertisements and promotions (also denoted as “spam”) to network users. As such, junk e-mail or spam is now or soon will become a major threat to trustworthy computing.
One proven filtering technique to combat spam is based upon a machine learning approach—machine learning filters assign to an incoming message a probability that the message is junk. In this approach, features typically are extracted from two classes of example messages (e.g., spam and non-spam (good) messages), and a learning filter is applied to discriminate probabilistically between the two classes.
In general, there are several types of learning algorithms that can be employed with respect to machine learning. In particular, conditional maximum entropy (maxent) models have been widely used for a variety of tasks, including language modeling, part-of-speech tagging, prepositional phrase attachment, and parsing, word selection for machine translation, and finding sentence boundaries. They are also sometimes called logistic regression models, maximum likelihood exponential models, log-linear models, and can be equivalent to a form of perceptrons, or single layer neural networks. In particular, perceptrons that use the standard sigmoid function, and optimize for log-loss are equivalent to maxent.
Conditional maxent models have traditionally either been unregularized or regularized by using a Gaussian prior on the parameters. However, when employing a Gaussian prior, higher error rates can result. For example, training data based at least in part upon Gaussian priors may yield increased incidences of inaccurate filtering with respect to catching spam. Regularization is needed to prevent overfitting—overfitting is a phenomenon related to a learning algorithm adapting so well to a training set that random disturbances in the training set are included in the model as being meaningful. As these disturbances do not reflect underlying distribution thereof, performance on a test set or training set of data (with its own, but definitively other disturbances) can suffer from techniques that learn too well.
Conditional maxent models are of the form
            P      Λ        ⁡          (              y        ❘                  x          _                    )        =            exp      ⁢                                    ∑                          i              =              1                                F                ⁢                              λ            i                    ⁢                                    f              i                        ⁡                          (                                                x                  _                                ,                y                            )                                                          ∑                  y          ′                    ⁢              exp        ⁢                              ∑            i                    ⁢                                    λ              i                        ⁢                                          f                i                            ⁡                              (                                                      x                    _                                    ,                                      y                    ′                                                  )                                                                        where x is an input vector, y is an output, the ƒi are so-called indicator functions or feature values that are true if a particular property of x, y is true, Λ represents a parameter set λl . . . λn, and λi, is a weight for the indicator ƒi. Consider an example such as word sense disambiguation. In this example, the goal is to determine if a particular word, e.g. “bank”, has a particular sense, e.g. financial bank or river bank. In this example, x would be the context around an occurrence of the word bank; y would be a particular sense, e.g., financial or river; ƒi( x, y) could be 1 if the context includes the word “money” and y is the financial sense; and λi, would be a large positive number. Other ƒi would represent other properties, e.g. the nearby presence of other words.        
Maxent models have several valuable properties. The most important is constraint satisfaction. For a given ƒi, we can count how many times ƒi was observed in the training data with value y, observed[i]=Σjƒi( xj,yj). For a model P λ with parameters λ, we can see how many times the model predicts that ƒi would be expected to occur: expected[i]=Σj,yP λ(y| xj)ƒi( xj,y). Maxent models have a property that expected[i]=observed[i] for all i and y. These equalities are called constraints. The next important property is that the likelihood of the training data is maximized (thus, the name maximum likelihood exponential model). Third, the model is as similar as possible to a uniform distribution (e.g., minimizes the Kullback-Leibler divergence), given the constraints, which is why these models are called maximum entropy models.
The third property is a form of regularization. However, it turns out to be an extremely weak one—it is not uncommon for models, especially those that use all or most possible features, to assign near-zero probabilities (or, if λ s may be infinite, even actual zero probabilities), and to exhibit other symptoms of severe overfitting. There have been a number of approaches to this problem. The most relevant conventional approach employs a Gaussian prior for maxent models. A Gaussian prior is placed with 0 mean and σi2 variance on the model parameters (the λis), and then a model that maximizes the posterior probability of the data and the model is found.
Maxent models without priors use the parameters Λ that maximize
  arg  ⁢          ⁢            max      Λ        ⁢                  ∏                  j          =          1                n            ⁢                        P          Λ                ⁡                  (                                    y              j                        ❘                                          x                j                            _                                )                    
where xj, yj are training data instances. With a Gaussian prior we find
  arg  ⁢          ⁢            max      Λ        ⁢                  ∏                  j          =          1                n            ⁢                                    P            Λ                    ⁡                      (                                          y                j                            ❘                                                x                  j                                _                                      )                          ×                              ∏                          i              =              1                        F                    ⁢                                    1                                                2                  ⁢                                      πσ                    i                    2                                                                        ⁢                          exp              ⁡                              (                                  -                                                            λ                      i                      2                                                              2                      ⁢                                              σ                        i                        2                                                                                            )                                                        
In this case, a trained model does not satisfy the constraints expected[i]=observed[i], but, as was shown, instead satisfies constraints
                              expected          ⁡                      [            i            ]                          =                              observed            ⁡                          [              i              ]                                -                                    λ              i                                      σ              i              2                                                          (        1        )            
That is, instead of a model that matches the observed count, a model that matches the observed count minus the value
      λ    i        σ    i    2  is obtained. In language modeling terms, this is referred to as “discounting.”
However, all models cannot be generated by the same process, and thus a single prior may not work best for all problem types.