Electronic mail (e-mail) has become an indispensable tool for business and personal communications. Unfortunately, a large percentage of e-mail that is received is unsolicited and unwanted, commonly referred to as spam. In addition, malicious originators hide viruses and other types of malicious software code in e-mail messages attempting to get unsuspecting users to launch or spread the code. Dealing with spam and malicious code wastes users' time and costs money in lost productivity and downtime. Typical systems for handling and filtering spam and malicious software are often difficult to manage and use. In addition, many systems are overly restrictive, blocking too many legitimate message originators, or overly permissive, allowing too many spam messages to pass to a user's inbox.
Some conventional filtering systems use statistical classifiers to determine whether a received message is spam. These statistical classifiers develop a spam score for a received message using information regarding the status of prior received messages. This information is stored in an associated classifier database. To operate effectively, these statistical classifiers systems require a user to initialize the classifier database through a manual bulk training process. In the bulk training process, the user identifies a set of “good messages” (i.e., non-spam/non-malicious) and a set of “bad messages” (i.e., spam/malicious). In addition, these systems recommend that users manually retrain the classifier periodically to adapt to the changing techniques of spammers and/or malicious message originators. Retraining is also in bulk. Without this periodic retraining, the training database is not kept up to date and as a result, the quality of the statistical classifier is reduced.
In addition, in these conventional statistical classifier systems, training is done on every message. As a result, these systems tend to have a large classifier database with a lot of overly redundant information. This unnecessary redundancy negatively impacts the performance of the database and the quality of the scores.
Therefore, what is needed is a system, method, and computer program product that automatically trains the classifier, without manual intervention, when an error in categorizing a message is detected.