1. Field of the Invention
This invention relates to electronic message analysis and filtering. More particularly, the invention relates to a system and method for improving a spam filtering feature set.
2. Description of the Related Art
“Spam” is commonly defined as unsolicited bulk e-mail, i.e., email that was not requested (unsolicited) and sent to multiple recipients (bulk). Although spam has been in existence for quite some time, the amount of spam transmitted over the Internet and corporate local area networks (LANs) has increased significantly in recent years. In addition, the techniques used by “spammers” (those who generate spam) have become more advanced in order to circumvent existing spam filtering products.
Spam represents more than a nuisance to corporate America. Significant costs are associated with spam including, for example, lost productivity and the additional hardware, software, and personnel required to combat the problem. In addition, many users are bothered by spam because it interferes with the amount of time they spend reading legitimate e-mail. Moreover, because spammers send spam indiscriminately, pornographic messages may show up in e-mail inboxes of workplaces and children—the latter being a crime in some jurisdictions. Recently, there has been a noticeable increase in spam advertising websites which contain child pornography. “Phishing” emails are another type of spam that request account numbers, credit card numbers and other personal information from the recipient.
1. Real-Time Spam Filtering
Various techniques currently exist for filtering spam. Specifically, FIG. 1 illustrates an exemplary spam filtering architecture which includes an email analyzer module 101, a mathematical model module 102 and a message processing module 103.
The email analyzer module 101 analyzes each incoming email message to determine whether the email message contains one spam-like “features.” Features used in content-based spam filters can be divided into three basic categories:
(1) Header information: Features that describe the information path followed by a message from its origin to its destinations as well as Meta information such as date, subject, Mail Transfer Agents (MTA), Mail User Agents (MUA), content types, etc.
(2) Message body contents: Features that describe the text contained in the body of an email, such as words, phrases, obfuscations, URLs, etc.
(3) Meta Features: Boolean Combinations of Other Features Used to Improve Accuracy
Once the features of an email message have been identified, a mathematical model 102 is used to apply “weights” to each of the features. Features which are known to be a relatively better indicator of spam are given a relatively higher weight than other features. The feature weights are determined via “training” of classification algorithms such as Naïve Bayes, Logistic Regression, Neural Networks, etc. Exemplary training techniques are described below with respect to FIG. 2.
The combined weights are then used to arrive at a spam “score.” If the score is above a specified threshold value, then the email is classified as spam and filtered out by message processing module 103. By contrast, if the score is below the specified value, then the spam processing module forwards the email on to a user's account to the email server 104.
2. Training
As mentioned above, the weights applied to features within the feature set are determined through a process known as “training.” Different algorithms use different methods of weight calculation including maximum entropy, error backtracking, etc. The spam model is regularly trained in order to assign weights to newly extracted features and update the weights associated with older features. Regular training helps to keep the weights of features updated according to the latest spam techniques in use.
FIG. 2 illustrates an exemplary training scenario which employs machine learning, a training technique developed by the assignee of the present patent application. See, e.g., Proofpoint MLX Whitepaper (2005), currently available at www.proofpoint.com. In this scenario, an email training corpus 200 containing known spam and ham messages is provided as a data source. A feature detection module 201 identifies features from the feature set within each email and provides this information to a machine learning module 202. The machine learning module 202 is also told whether each message is spam or ham. Using this information, the machine learning module 202 calculates a correlation between the features and spam messages, i.e., it determines how accurately certain features identify spam/ham. As mentioned above, various machine learning algorithms may be used such as Naïve Bayes, Logistic Regression, Neural Networks, etc.
The calculations performed by the machine learning module 202 are expressed in the form of a weight file 203 which associates a weight with each of the features in the feature set. For example, features which identify spam with relatively greater accuracy (e.g., “buy Viagra”) are provided with relatively larger weights than other features (e.g., “visit online”). The weight file is subsequently used to perform spam filtering operations as described above.
3. Obfuscation Techniques
One of the famous tricks of fooling spam filters that rely on machine learning is to introduce random text or noise in the email text. For example, “Viagra” is spelled “V|@gr@” and “mortgage” is spelled “m_o_r_t_g-a-g-e.” The problem of obfuscation becomes quite cumbersome because there are virtually endless ways to obfuscate a given word using various combinations of tricks and characters.
These common tricks include, for example:
1) Substitution: Viagra→V|@gra
2) Addition: Viagra→Viaagraa
3) Deletion: Viagra→Vigra
4) Shuffling: Viagra→Vgiara
5) Segmenting: Viagra→V I A G R A
6) Combination: Viagra→V !@ gra
There are at least two methods currently employed to counter the text obfuscation problem. The first method is to de-obfuscate the spam message as a preprocessing step of classification. That is, converting an obfuscated word like “v|@graa” back to its original form “Viagra” so that the email filter can recognize the true words. Another technique to counter obfuscation is to identify the obfuscated words in an email and use them as an indication of spam. So, if a word “Viagra” was intentionally written as “v|@graa” then this knowledge can be used by the spam classifier as a feature.
Converting obfuscated words to their true form seems like an excellent way of handling the problem and the results of previous research have also reported a de-obfuscation accuracy of 94%. However, there are certain drawbacks that make this solution impractical for the larger spam filters. First, this technique is extremely expensive. The previous study reports a de-obfuscating rate of 240 characters/sec using 70 characters including 26 letters of the alphabet, space, and all other standard ASCII characters, but excluding control characters. This rate of de-obfuscation is very slow for a preprocessing stage of a large-scale spam classifier which may receive millions of daily emails each of which may contain 1000 s of characters. In addition, in practice, significantly more than 70 characters such as foreign language characters are used in obfuscation, further exacerbating the problem.
Using a slow and computationally expensive preprocessing technique will increase both email delivery time and hardware requirements. This not only makes the solution more expensive for the end user but it also creates severe performance issues for service providers.
Taking the above constraints into consideration, another technique to counter obfuscation is to identify the obfuscated words in an email and use them as an indicative of spam. The idea here is simple; include all of the obfuscated words in the feature set of the spam classifier. Thus, the correct classification of the above email example uses t0night, R01ex, Viissit On!ine and Cl!!ck here in the feature set of the spam filter. Manually adding these words or using regular expressions to catch them is not only expensive to maintain but is also a short-term goal, as the life of each obfuscated word is very short because spammers frequently change the obfuscation of a word.
A better solution would be an intelligent system driven by machine learning that can identify such words. Such a classifier has previously been used but has a low success rate of around 70%-75%. With respect to computational performance, detecting obfuscation provides better results than de-obfuscation as discussed below.
The foregoing discussion concludes that there is a tradeoff between accuracy and the computational performance in current solutions for obfuscation. Accordingly, improved techniques for detecting obfuscation are needed. Keeping this tradeoff into consideration, the embodiments of the invention described below employ a model with high obfuscation detection accuracy and with low computational complexities. Only such a model will fit the needs of a real world enterprise class spam solution. In addition to the obfuscation detection model, a general architecture is described below for integrating auxiliary spam detection models within the context of a base spam detection model.