1. Field of the Invention
This invention relates to electronic message analysis and filtering. More particularly, the invention relates to a system and method for performing real-time message stream analysis on a series of email messages.
2. Description of the Related Art
“Spam” is commonly defined as unsolicited bulk e-mail, i.e., email that was not requested (unsolicited) and sent to multiple recipients (bulk). Although spam has been in existence for quite some time, the amount of spam transmitted over the Internet and corporate local area networks (LANs) has increased significantly in recent years. In addition, the techniques used by “spammers” (those who generate spam) have become more advanced in order to circumvent existing spam filtering products.
Spam represents more than a nuisance to corporate America. Significant costs are associated with spam including, for example, lost productivity and the additional hardware, software, and personnel required to combat the problem. In addition, many users are bothered by spam because it interferes with the amount of time they spend reading legitimate e-mail. Moreover, because spammers send spam indiscriminately, pornographic messages may show up in e-mail inboxes of workplaces and children—the latter being a crime in some jurisdictions. Recently, there has been a noticeable increase in spam advertising websites which contain child pornography. “Phishing” emails are another type of spam that request account numbers, credit card numbers and other personal information from the recipient.
1. Real-Time Spam Filtering
Various techniques currently exist for filtering spam. Specifically, FIG. 1 illustrates an exemplary spam filtering architecture which includes an email analyzer module 101, a mathematical model module 102 and a message processing module 103.
The email analyzer module 101 analyzes each incoming email message to determine whether the email message contains one spam-like “features.” Features used in content-based spam filters can be divided into three basic categories:
(1) Header information: Features that describe the information path followed by a message from its origin to its destinations as well as Meta information such as date, subject, Mail Transfer Agents (MTA), Mail User Agents (MUA), content types, etc.
(2) Message body contents: Features that describe the text contained in the body of an email, such as words, phrases, obfuscations, URLs, etc.
(3) Meta features: Boolean combinations of other features used to improve accuracy
Once the features of an email message have been identified, a mathematical model 102 is used to apply “weights” to each of the features. Features which are known to be a relatively better indicator of spam are given a relatively higher weight than other features. The feature weights are determined via “training” of classification algorithms such as Naïve Bayes, Logistic Regression, Neural Networks, etc. Exemplary training techniques are described below with respect to FIG. 2.
The combined weights are then used to arrive at a spam “score.” If the score is above a specified threshold value, then the email is classified as spam and filtered out by message processing module 103. By contrast, if the score is below the specified value, then the spam processing module forwards the email on to a user's account to the email server 104.
2. Training
As mentioned above, the weights applied to features within the feature set are determined through a process known as “training.” Different algorithms use different methods of weight calculation including maximum entropy, error backtracking, etc. The spam model is regularly trained in order to assign weights to newly extracted features and update the weights associated with older features. Regular training helps to keep the weights of features updated according to the latest spam techniques in use.
FIG. 2 illustrates an exemplary training scenario which employs machine learning, a training technique developed by the assignee of the present patent application. See, e.g., Proofpoint MLX Whitepaper (2005), currently available at www.proofpoint.com. In this scenario, an email training corpus 200 containing known spam and ham messages is provided as a data source. A feature detection module 201 identifies features from the feature set within each email and provides this information to a machine learning module 202. The machine learning module 202 is also told whether each message is spam or ham. Using this information, the machine learning module 202 calculates a correlation between the features and spam messages, i.e., it determines how accurately certain features identify spam/ham. As mentioned above, various machine learning algorithms may be used such as Naïve Bayes, Logistic Regression, Neural Networks, etc.
The calculations performed by the machine learning module 202 are expressed in the form of a weight file 203 which associates a weight with each of the features in the feature set. For example, features which identify spam with relatively greater accuracy (e.g., “buy Viagra”) are provided with relatively larger weights than other features (e.g., “visit online”). The weight file is subsequently used to perform spam filtering operations as described above.
Typically, the training process described above is performed periodically (e.g., once a day) at a central spam analysis facility and the results of the training process are pushed out to customer sites (i.e., sites where the spam engine shown in FIG. 1 is executed). Consequently, a delay may exist between the time a new spam campaign is initiated and the time the new definitions needed to identify the spam campaign are sent to the customer site. As such, new, more dynamic techniques for identifying spam campaigns in real-time (or near real-time) are needed.