Machine learning based spam filters require training data in order to be successful. A common problem with this class of filters is that it is difficult to gather training data that is representative of the environment of the user of the computing device, especially without manual user feedback.
Current machine learning based spam filters are trained by a third party, by the user, or by both a third party and the user. The third party may be a software publisher. For example, the spam filter may be Norton Spam Alert, which is trained by its software publisher, SYMANTEC® Corporation of Cupertino, Calif. Machine learning based spam filters trained by third parties tend to have a lot of false positives, because the training corpus for the filter normally does not contain many clean electronic messages that are actually experienced by an individual user or enterprise. However, because such a third party corpus contains a good representation of the overall spam experienced by the users and enterprises, the false negative rate is usually low. On the other hand, spam filters trained exclusively by an individual user or enterprise typically result in a low false positive rate (because of the relatively large volume of clean messages available to the user or enterprise precisely representing what is typical for that user or enterprise) but a medium false negative rate, because the user or enterprise uses a relatively small sample of spam training messages compared with a third party.
Filters are available that are initially trained by a third party and then retrained manually over time by the user or enterprise. While this technique is feasible for an individual user, it presents problems for enterprises, because the enterprise must process a very large volume of messages (all the messages of all the individual computing devices within the enterprise).
The present invention improves the training of machine learning based spam filters, so that such filters can enjoy a low false positive rate and a low false negative rate, and can be used effectively by both individual users and enterprises.