Current statistical spam detection techniques rely heavily on their ability to find known words during classification of electronic messages. The authors of spam emails have become aware of this, and have started to include nonsense words in their messages. The use of nonsense words to spoof spam detection takes two primary forms. The first is the insertion of a small number (e.g., one or two) of nonsense words into emails. This is used to thwart simple hash detection of duplicate copies of a single message being sent to many users at one internet service provider. By inserting different nonsense words into each copy of the message, the hash detection does not determine that the messages are duplicates. This form of nonsense word insertion is referred to as a “hash buster.” The second form consists of inserting a large number of nonsense words into emails, where the words as a group cause misclassification of the overall message.
Statistical spam detection relies heavily not only on the ability to find known words, but to classify them as well. For example, specific words, combinations of words and frequency of occurrence of words tend to be associated with spam emails. Anti-spam engines can be tricked into misclassifying spam messages as legitimate when the author replaces characters in these “spammy” words with other characters that have a similar visual appearance. For example, the characters ‘â Ô ç β {hacek over (G)} {acute over (Ø)} ′Y Z Θ X K III’ look like ‘a O C B G O Y Z O X K W’ respectively. Since the substituted characters are similar in visual appearance, a human reader can still discern the content of the message.
What is needed are methods, computer readable media and computer systems for allowing detection of undesirable emails, even where nonsense words and visually similar characters have been inserted.