This invention pertains to detecting spam e-mail.
Spam e-mail is a significant and growing nuisance. As used herein, “spam” is any e-mail that is sent to a computer user without the user's consent.
One known method of detecting spam is by performing a signature based analysis of e-mail. Spam e-mail can often be identified with a signature. A signature is some feature that occurs in a specific sample but is unlikely to occur in other samples in a given population. Signature analysis can consider a whole e-mail message at once, specific ranges of bytes within the message, the presence or absence of certain fields anywhere within the message or within certain byte ranges, etc. A signature can be represented as a hash of all, or of certain portions of a message. Statistical methods such as information gain can be used to extract significant features that distinguish one sample population from another.
An example signature might be a specific subject line, such as “Lose Weight While You Sleep!” Another example signature might attempt to catch a whole class of spam messages by looking for the presence of a phone number or URL used in a set of spam messages. Spam senders often change the content or layout of a message, but they usually want the recipient to do something that involves calling a phone number or visiting a particular web page; these items can make for good signature targets. Many other types and examples of signature targets are known to those of ordinary skill in the relevant art. Likewise, various possible signature analysis optimizations are known.
Because of the sheer volume of spam and the use of obfuscation techniques, producing antispam signatures can be complex. Sophisticated back-end infrastructures called spam traps that gather spam and automatically analyze the spam to extract good signatures are often used to create a viable, accurate, and up-to-date signature set. Once a spam trap identifies new signatures, the spam trap can provide them to a signature analysis based spam detection system. Because it takes time for a trap manager to identify new signatures and make them available to a spam detection system, there is often a delay before detection systems can effectively detect spam containing these new signatures. Thus, detection systems can either process e-mail upon receipt, without access to the latest signatures, or hold received e-mail for a period of time in order to wait for the receipt of new signatures, thereby causing a processing delay.
Another known methodology for spam detection involves the use of neural networks and similar machine learning techniques (e.g., Bayesian networks, support vector machines). Machine learning systems are sometimes trained to recognize spam and legitimate e-mail. These approaches require that the network be trained against both spam and non-spam. The training steps involve analyzing words that occur in samples and their frequency of occurrence. Words that are common to both sample sets are eliminated, and a list of words common to spam and a list of words common to non-spam is left. Using the word list common to spam and the spam sample set, the network is trained to recognize spam. Similarly, using the word list common to non-spam and the non-spam sample set, the network is trained to recognize non-spam. The result is a network which, given a sample, outputs a number that represents the likelihood of the message being spam. The component is configured to consider samples with a value over a given threshold as spam.
When using neural networks to detect spam, there is always a tradeoff between detection and false positive rates. If the neural network is configured to identify spam aggressively, the neural network will generally incorrectly identify a certain number of legitimate e-mails as spam. On the other hand, if the neural network is configured to minimize false positives, it will typically allow a certain number of false negatives as well.
What is needed are methods, computer readable media and systems that allow spam detection with the accuracy of signature based analysis, without the latency inherent therein, and without the tradeoff between false positives and accuracy inherent in machine learning system detection.