There are many applications for automatic classification of items such as e-mail, documents, images, and recordings. To address this need, a plethora of classifiers have been developed based on probabilistic dependency models learned from training data. Some examples of classifiers based on probabilistic dependency models include logistic regression models, decision trees models, support vector machines, Naive Bayes models, and neural networks.
Logistic regression models are also called maximum entropy models and are equivalent to a certain kind of single layer neural network. In particular, logistic regression models are of the form:
                                          P                          w              _                                ⁡                      (                          Y              =                              1                |                                  x                  _                                                      )                          =                              exp            ⁡                          (                                                w                  _                                ·                                  x                  _                                            )                                            1            +                          exp              ⁡                              (                                                      w                    _                                    ·                                      x                    _                                                  )                                                                        Equation        ⁢                                  ⁢                  (          1          )                    
In this equation, Y is the variable being predicted (in this case, Y takes the values 0 or 1, with 1 meaning that a message is positive, for example as classified as spam), w represents a set of weights, x represents an instance of input data, such as the words in an e-mail message; for instance, it can be a vector of 1's and 0's, with a 1 indicating that a particular word is present in the message. The weights indicate the relative weights for each word. These weights are typically learned in such a way as to maximize the probability of the training data as follows:
  arg  ⁢          ⁢            max              w        _              ⁢                  ∏                  i          =          1                n            ⁢                        P                      w            _                          ⁡                  (                      Y            =                                          y                i                            |                                                x                  _                                i                                              )                    
That is, a set of weights w is identified that make the training data as likely as possible: for example they do as good a job as possible of predicting that each spam message is spam, and each good message is good. In practice, these weights are almost always regularized, for example with a Gaussian prior. It is assumed that the average weight should be 0, and that very large weights are unlikely. Letting N(w; 0, σ2) represent the probability of a variable w being generated by a Gaussian distribution with mean 0 and variance σ2, the actual formula maximized is:
  arg  ⁢          ⁢            max              w        _              ⁢                  ∏                  i          =          1                n            ⁢                                    P                          w              _                                ⁡                      (                          Y              =                                                y                  i                                |                                                      x                    _                                    i                                                      )                          ·                              ∏                          j              =              1                        k                    ⁢                      N            ⁡                          (                                                                    w                    i                                    ;                  0                                ,                                  σ                  2                                            )                                          
This means that weights w are identified that overall maximize the probability of the training data, and the probability of the weights w.
A training algorithm that can be used is Sequential Conditional Generalized Iterative Scaling (SCGIS), although because logistic regression models have a global optimum, the choice of learning algorithm is typically of little importance, except for training speed considerations.
Naive Bayes is a well known algorithm, especially for e-mail spam filtering. Naive Bayes computes the probability of an e-mail message as a whole: given all possible good messages, what is the probability that this particular message was generated; given all possible spam messages, what is the probability that this particular message was generated. There is an assumption of conditional independence, that all words in the message were generated independently of the others, given the label (i.e., spam or good) of the message. Naive Bayes is concerned with accurately estimating the probability of all messages, and there is no focus on any particular region of a Receiver Operating Characteristic (ROC) curve.
Traditionally classifiers have been trained using a sample set of data representative of the items being classified and are optimized to meet certain accuracy and/or entropy requirements equally across the entire region of classification. Accuracy can be optimized for equal costs for false positives and false negatives and entropy can be optimized for estimating probabilities correctly across the entire range of probabilities. However, there are many cases where the classifier must meet higher requirements for one or more particular regions of the classification, such as in the low false-positive or in the low false-negative region. An application threshold would normally be set to achieve the desired requirement, such as a low false positive rate for good mail being classified as spam in an e-mail application. A classifier that has been optimized for equal performance across the entire region of classification can have a reduced effectiveness when using a threshold in classifying data that falls within a particular region of interest.
There is a need to provide a classifier that has been optimized for a specific area of the classification region that is of interest. For example, a user that has enabled a spam filter in their e-mail application does not want their good messages to end up in a spam folder, even at the expense of receiving a few spam messages in their inbox. Therefore, the spam filter must be optimized to have a low false positive rate for identifying good messages as spam. In another example, a lab that employs classifiers to identify cancerous cells in a sample is more concerned with letting a few negative samples to be identified as positive over missing a positive identification of a cancerous sample. The cost of missing a positive sample can be significant in terms of lost treatment time for the patient, where a negative sample that is incorrectly identified as positive will likely be caught in further testing. In this case, the classifier must be optimized to have a low false negative rate for classifying positive samples as negative.