Logistic regression is a type of a statistical classification method, and may be used to classify or filter documents such as “spam” or “junk” emails. In this application, logistic regression uses previous classifications of documents and the features in those documents to generate the models (parameters), and uses those models to predict the classification of new documents based upon a logistic regression function. For example, an email filtering system may develop logistic regression parameters based upon previous classifications (“spam” or “non-spam”) of documents (e.g., test email or test data) input to the email filtering system, and use those parameters with a logistic regression algorithm to predict whether a new email input to the email filtering system is “spam” or “non-spam.”
The problem of identifying spam email is unlike other classification problems, where the feature of the classes are generally constant and only need to be learned once. The characteristics of spam emails are continually evolving, as spammers attempt to defeat the filtering systems. Thus, any given set of features learned by any filtering system, including a logistic regression model, will eventually fail to usefully identify a spam email. Accordingly, it is desirable for a logistic regression model to be updated to reflect the new changing nature of the data that becomes available over time. However, conventional logistic regression algorithms do not ensure that the update to the logistic regression models actually enhances the accuracy of classification. Specifically, conventional logistic regression algorithms are not able to determine whether the updates to the logistic regression parameters suggested by the new classification data are coherent with the logistic regression models already in place. These existing logistic regression models were generated based upon the classifications of previous documents that still have significance to the filtering system, and thus should not be completely disregarded when updating the logistic regression parameters.
Therefore, there is a need for a method of updating logistic regression models based upon new classification data, in a manner that properly preserves the characteristics of the existing logistic regression models. There is also a need for ensuring that the updated logistic regression models enhance the accuracy of classification when used in email or document filtering systems. In addition, there is a need for determining whether the updates to the logistic regression parameters suggested by the new classification data are coherent with the logistic regression models already in place, for example, in the email filtering system.