Current statistical spam detection techniques rely heavily on their ability to find known words during classification of electronic messages. The authors of spam emails have become aware of this, and often include nonsense words in their messages. The use of nonsense words to spoof spam detection takes two primary forms. The first is the insertion of a small number (e.g., one or two) of nonsense words into emails. This is used to thwart simple hash detection of duplicate copies of a single message being sent to many users at one internet service provider. By inserting different nonsense words into each copy of the message, simple hash detection routines are not able to determine that the messages are duplicates. This form of nonsense word insertion is referred to as a “hash buster.” The second form consists of inserting a larger number of nonsense words into emails, where the words as a group cause misclassification of the overall message.
Spam classification engines analyze the content of email messages and attempt to determine which emails are spam based on various statistical techniques, such as Bayesian analysis. Bayesian spam filtering is based on established probabilities of specific words appearing in spam or legitimate email. For example, the nonsense words described above, as well as certain words such as “Viagra”, “Refinance”, “Mortgage” etc, frequently appear in spam, and yet rarely or less frequently appear in legitimate email. Thus, the presence of such terms increases the probability of an email being spam. A Bayesian spam classification engine has no inherent knowledge of these probabilities, but instead establishes them by being trained on a set of email messages.
When classifying documents using a statistical method, such as the Bayesian method, the classifications output is only as good as the input. This leads to a problem when a statistical classifier is presented with a message in a language in which the classifier was not trained (for example, when a classifier trained in English is attempting to classify a German document). More specifically, it has become common for spammers to insert words or phrases in foreign languages in spam emails, as opposed to or in addition to nonsense words. This often results in certain common foreign language words (e.g., “el”, “los”, “der”, “die”, “und”, etc.) becoming associated with spam by classification engines. Because these words appear in many spam emails but virtually no legitimate emails written in English, a Bayesian classification engine trained on an English language data set will interpret their presence in an email message is a strong indication of the message comprising spam.
In the past, the issue of content in a non-trained language has been addressed in two different ways. One solution is to use a secondary classifier that is capable of determining the language of a document. The input to the Bayesian spam filter is then limited to content in languages on which it has been trained. The second solution is for the Bayesian filter to attempt to classify every document, regardless of language.
The first solution is expensive, both in terms of dollars and computing efficiency. In order to classify each document by language, expensive language classification engines must be licensed or built simply to determine if a spam engine should inspect an incoming message. Furthermore, classifying each incoming email with an additional engine is time consuming, and slows down the spam filtering process.
In the context of spam, the later solution typically leads to extremely high false positive rates when filtering emails in languages on which the Bayesian filter has not been trained. As noted above, very common words in non-trained foreign languages were likely prevalent in the training data in spam only. For example, when training on an English email set, words like “und” and “der” appear frequently in spam and almost never in legitimate email. However, when processing German email, these words appear in almost every message, whether spam or legitimate. Thus, a classifier trained in English but not German would classify all or most German email messages as spam.
It would be desirable to be able to avoid such an excessive false positive rate when processing content in a language on which the Bayesian filter has not been trained, without having to use an expensive secondary classifier that is capable of determining the language of a document.