Spam and e-mail carrying malicious attachments (e.g., viruses, worms, etc.) are a serious computer security problem. Batches of spam are often sent out in mass, frequently with slight variations, either in order to defeat spam filters or as a result of product or transmission particulars and the like. Once a specific spam email message has been identified, it would be useful to be able to detect similar messages that are not identical, but are part of the same spam attack.
A known method for determining general document similarity involves extracting n-grams from the documents in question, comparing the n-grams, and determining the percentage of n-grams that the documents have in common.
Feature selection is one way to improve the similarity calculation. One approach to feature selection is to eliminate parts of the document that are not considered to be useful for the purpose of comparing messages. A common form of feature selection is to use a list of “stop words,” such as “the” “and” “or,” and similar very common words that are found across documents. By eliminating such words from the comparison, a more useful measure of document similarity can be made.
However, in the special case of spam email messages, the features that it is desirable to eliminate are likely not to be a simple list of common words, but artifacts of how the message was produced or transmitted, including both text and graphical artifacts. To the extent that such artifacts are present in email messages and become part of the set of features compared, they result in a less useful similarity measure. This results in an increased likelihood of false positives.
What is needed are methods, systems and computer readable media for determining email messages similarity, taking into account the specialized feature selection inherent in the case of email messages.