The advent of global communications networks such as the Internet has presented commercial opportunities for reaching vast numbers of potential customers. Electronic messaging, and particularly electronic mail (“e-mail”), is becoming increasingly pervasive as a means of disseminating unwanted advertisements and promotions (also denoted as “spam”) to network users.
The Radicati Group, Inc., a consulting and market research firm, estimates that as of August 2002, two billion junk (or spam) e-mail messages are being sent every day. This number is expected to triple every two years. More and more people are becoming inconvenienced and offended by the junk e-mail that they receive. As such, junk e-mail is now or soon will become the principal perceived threat to trustworthy computing.
A key technique utilized for thwarting junk e-mail is content filtering. A proven technique for filtering is based upon a machine learning approach. Machine learning filters assign to an incoming message a probability of the message content being junk. In this approach, content features are extracted from two classes of example e-mail (i.e., junk and non junk e-mails), and a learning filter is applied probabilistically to discriminate the two classes. Since many of the features of e-mail are related to content (e.g., words and phrases in the subject and body), these filters are also commonly referred to as “content-based filters”.
The goal of a spammer is to make changes in (or “cloak”) their message content so that junk filters are unable to detect that the e-mail is spam. This is often done to prevent the detection of phrases or words commonly associated with spam content. Spammers also frequently make small changes to individual e-mail messages when executing mass mailings on the order of, for example, 100,000 messages or more. Making subtle changes to individual messages in a mass mailing significantly reduces the probability that junk filters will detect that the same message is being sent to large groups of users.
The following techniques are some examples used by spammers, not necessarily to mislead the recipient reader, since the tricks are removed or resolved prior to the reader perceiving the message, but to prevent junk filters from successfully matching words, phrases, or even the entire e-mail message: HTML comments, which are those comments added to the HTML version of the message body, cause problems for the spam filter, and are removed prior to the e-mail message being viewed by the reader; declarative decoration content is that content that has little or no affect on the e-mail text, e.g., HTML tags, yet changes the message; encoding occurs where the message text is changed by using special types of encoding, e.g., foreign language characters; and HTML positioning, where the e-mail message is created in such a way that visually, the order of the text is changed from that which is ultimately perceived user, since HTML can be used to change the text position.
What is needed is a technique that solves the aforementioned problem by resolving obfuscating content of messages prior to filtering