Email spam, also known as bulk or junk email, usually involves sending nearly identical unsolicited email messages to numerous recipients by email. It has been estimated to cost US businesses over a billion dollars per year. Since the cost for sending spam to large amounts of recipients is small, spammers have no incentive to limit their mailings either by size or to people who might be interested in receiving the email message. It is the recipient who bears the costs of this junk mail. There have been some estimates that 80-85% of incoming email messages are spam.
Spammers go to great lengths to remain undetected. They constantly change the very things traditional spam filtering systems look at to determine whether an email message is spam. They change their sender name, IP address, network address etc. They can set up temporary and disposable accounts at numerous Internet service providers. Once an account has become old or stale or detected, they quickly move to a new account from which to send email messages, thereby changing the sender, IP address and network associated with the email message. They pretend to originate an email message from any email address by spoofing either the email address or the IP address. They can forge delivery headers so the email message seems to come from a legitimate email server and network. They can take over networks of infected PCs creating zombies or botnets to send out email messages so they and their IP address are no longer sending the email message. They can disguise the email message content by misspelling words that would be easily detectable by a program yet still understandable by a human being.
There are currently at least seven different approaches to filtering spam: IP blacklists (e.g., RBLs), rules based (e.g., SpamAssassin), Bayesian (e.g., Netscape, PopFile), reputation (e.g., Ironport), Decoy (e.g., Symantec), collaborative checksum (e.g., Hotmail, Cloudmark) and a cocktail combination of some or all of the above (e.g., Ciphertrust).
Most existing email message anti-spam filtering systems focus on the sender or the content of the incoming email message. These systems use spam algorithms to calculate a message score for an email message and the message score is then used to determine whether the email message is ham or spam. The algorithms are trained either by one of two methods. They can be trained with a training set of email messages which are pre-categorized by a human or some other system. They can be self-trained once they have been bootstrapped, i.e., they feedback their own message score and ham or spam determination for training. This feedback is then used to update the list of senders and various message attributes such as the reputation of URLs in the email message and can cause positive feedback errors. These systems are not very accurate.
The best systems seldom achieve better than 98% effectiveness at filtering out spam. Some rely on user input to classify spam and some require extensive training to be accurate. These systems cannot get much better because they focus on the sender or the content of the email message. Criteria checked by these systems when monitoring incoming email messages are the very things which the spammers are changing, hiding and constantly trying to defeat.
For example, a spammer sends out exactly 1,000 email messages in so-called micro spams. The spammer ensures that the content passes filtering programs such as SpamAssassin by sending from a fresh IP address to a relatively clean list of real users. This email message passes all the systems.
Consider another example where a legitimate company sends email messages from their email system. In this hypothetical example, assume an email message from a reputable company is sent out using their domain name. The sending IP address is not in any blacklist. The content is clean and does not violate any SpamAssassin rules. This company does two mailings at the same time, both with identical content and both from the same IP address. The first is to one million people who registered on the company's website. The second is to a list of one million people that someone bought from a spammer which has been cleansed of any email addresses on a decoy database. This means they are all real, actual users. The company sends out the email messages from multiple IP addresses so that the entire two million piece mailing takes place in less than one minute. Typically, none of the filtering systems will catch the email messages sent with the second spam mailing. That is because the only difference between the two mailings is the recipient list that was used. None of the filtering systems takes that into account.
A Bayesian system uses recipient reputations for email filtering to a small extent. With Bayesian systems, message token reputations are established from training email messages, as either spam or ham (good email messages that are wanted or solicited by the recipient). Recipients are treated like any other token in the email message used to determine a score for the email message. A combination of scores of all the tokens are used. Bayesian systems typically are “per user” systems. The filter for a specific user's email messages will only encounter recipients of email messages that are also sent to that specific user. What is needed is a more reliable email filtering system.