1. Field of the Invention
This invention relates to electronic message analysis and filtering. More particularly, the invention relates to a system and method for improving a spam filtering feature set.
2. Description of the Related Art
“Spam” is commonly defined as unsolicited bulk e-mail, i.e., email that was not requested (unsolicited) and sent to multiple recipients (bulk). Although spam has been in existence for quite some time, the amount of spam transmitted over the Internet and corporate local area networks (LANs) has increased significantly in recent years. In addition, the techniques used by “spammers” (those who generate spam) have become more advanced in order to circumvent existing spam filtering products.
Spam represents more than a nuisance to corporate America. Significant costs are associated with spam including, for example, lost productivity and the additional hardware, software, and personnel required to combat the problem. In addition, many users are bothered by spam because it interferes with the amount of time they spend reading legitimate e-mail. Moreover, because spammers send spam indiscriminately, pornographic messages may show up in e-mail inboxes of workplaces and children—the latter being a crime in some jurisdictions. Recently, there has been a noticeable increase in spam advertising websites which contain child pornography. “Phishing” emails are another type of spam that request account numbers, credit card numbers and other personal information from the recipient.
1. Real-Time Spam Filtering
Various techniques currently exist for filtering spam. Specifically, FIG. 1 illustrates an exemplary spam filtering architecture which includes an email analyzer module 101, a mathematical model module 102 and a message processing module 103.
The email analyzer module 101 analyzes each incoming email message to determine whether the email message contains one spam-like “features.” Features used in content-based spam filters can be divided into three basic categories:
(1) Header information: Features that describe the information path followed by a message from its origin to its destinations as well as Meta information such as date, subject, Mail Transfer Agents (MTA), Mail User Agents (MUA), content types, etc.
(2) Message body contents: Features that describe the text contained in the body of an email, such as words, phrases, obfuscations, URLs, etc.
(3) Meta features: Boolean combinations of other features used to improve accuracy
Once the features of an email message have been identified, a mathematical model 102 is used to apply “weights” to each of the features. Features which are known to be a relatively better indicator of spam are given a relatively higher weight than other features. The feature weights are determined via “training” of classification algorithms such as Naïve Bayes, Logistic Regression, Neural Networks, etc. Exemplary training techniques are described below with respect to FIG. 2.
The combined weights are then used to arrive at a spam “score.” If the score is above a specified threshold value, then the email is classified as spam and filtered out by message processing module 103. By contrast, if the score is below the specified value, then the spam processing module forwards the email on to a user's account to the email server 104.
2. Training
As mentioned above, the weights applied to features within the feature set are determined through a process known as “training.” Different algorithms use different methods of weight calculation including maximum entropy, error backtracking, etc. The spam model is regularly trained in order to assign weights to newly extracted features and update the weights associated with older features. Regular training helps to keep the weights of features updated according to the latest spam techniques in use.
FIG. 2 illustrates an exemplary training scenario which employs machine learning, a training technique developed by the assignee of the present patent application. See, e.g., Proofpoint MLX Whitepaper (2005), currently available at www.proofpoint.com. In this scenario, an email training corpus 200 containing known spam and ham messages is provided as a data source. A feature detection module 201 identifies features from the feature set within each email and provides this information to a machine learning module 202. The machine learning module 202 is also told whether each message is spam or ham. Using this information, the machine learning module 202 calculates a correlation between the features and spam messages, i.e., it determines how accurately certain features identify spam/ham. As mentioned above, various machine learning algorithms may be used such as Naïve Bayes, Logistic Regression, Neural Networks, etc.
The calculations performed by the machine learning module 202 are expressed in the form of a weight file 203 which associates a weight with each of the features in the feature set. For example, features which identify spam with relatively greater accuracy (e.g., “buy Viagra”) are provided with relatively larger weights than other features (e.g., “visit online”). The weight file is subsequently used to perform spam filtering operations as described above.
3. Feature Selection
To efficiently handle the continuous introduction of new types of spam emails, it becomes vitally important to continually add new features or attributes to the model (the terms “attributes” and “features” are used interchangeably herein). One very important step to keep classifiers “healthy” and efficient is to keep track of these attributes and monitor their discriminative ability. It is essential to keep “good” (highly discriminative) attributes to ensure ongoing classification accuracy. But it is also important to discard “bad” (irrelevant or ineffective) attributes for at least the following reasons:                Bad attributes increase the error in classification, bringing down overall effectiveness.        As an increasingly large number of attributes are added the complexity of model complexity grows, resulting in increased classification times, memory usage and CPU utilization.        There is a risk of over-fitting the model, caused by redundant or useless attributes as the model has to over train itself to produce high accuracy on the training data due to the presence of bad features. This overtraining results in a drop in accuracy in the test data and this effect is called over-fitting.        
Being able to distinguish between good and bad features is essential for ensuring the long-term effectiveness of the model. The logic behind any feature extraction in spam filtering is that the feature should occur frequently in spam messages and infrequently in ham messages and vice versa. An ideal feature would “fire” only in spam or only in ham messages. As used herein, a feature “fires” when that feature is present in an email message.
Consequently, the methods used to evaluate the quality of extracted features are extremely important to ensure both high effectiveness in identifying spam and low false positive rate. One well known example is the open source spam filter SpamAssassin (“SA”), which calculates the effectiveness of a feature using the “S/O metric.” S/O calculates feature quality by measuring the Hit Frequency, which is defined as the proportion of the spam messages in which a feature fired. For example, if a feature is present in 800 out of 1000 spam messages, then its S/O value is 0.8
Measuring the quality of features based on their S/O value biases the feature set towards “all spam” features. This method of feature selection works satisfactorily for individual spam filters where a 2-3% false positive rate is tolerable. However, enterprise-class spam filters have more stringent performance requirements. In enterprise spam solutions, designed to protect the messaging systems of large organizations with thousands of end users, even false positive rates over 0.1% result in a large amount of customer dissatisfaction.
It can be seen from the foregoing description that the effectiveness of enterprise-class spam e-mail filters relies on the quality of the feature set used by the filter's classification model. Highly effective filters may employ an extremely large number of such features (e.g., 350,000 features), which can consume a significant amount of storage space and classification time. Due to the “adversarial” nature of spam, the quality of individual features keeps changing as spam email campaigns evolve or as new campaigns emerge. Regularly discarding features which have become ineffective (“bad features”) benefits the spam filter with reduced classification time (model training time and reduced email delivery time), reduced storage requirements, increased spam detection accuracy and less risk of over-fitting of the model.
Accordingly, improved techniques for selecting beneficial features and removing inefficient features are desirable.