Unsolicited and undesired e-mail, or spam, is a becoming a significant problem with the increasing amount of spam sent over computer networks, causing a substantial nuisance to e-mail users. Spam causes waste of network bandwidth, messaging system resources, and both human and computer processing time. As used herein, “spam” is any e-mail that is unwanted or unsolicited by the recipient or to which the recipient did not consent.
To reduce this problem, anti-spam filtering software and/or hardware may be deployed to analyze message streams to discriminate between spam and non-spam e-mail messages. Spam filters, however, typically focus on analysis of the content of e-mail messages, which can require substantial analysis time due to the size of the e-mail body and due to the fact that e-mail systems are frequently inundated with a flood of spam messages.
Considerable spam-filtering efficiency could be gained by avoiding extensive content examination of each and every e-mail message. Instead, bulk spam e-mails can be readily identified by simple header field examination, such as analysis of From/To fields and/or Subject header fields. In addition, such a header analysis could be an effective form of spam discrimination for newer forms of spam, such as cellular phone message spam, blog comment thread spam, or instant messaging spam (or “spim”). Each of these spam types lack significant content body, making traditional content-based spam-filtering less effective. In these spam types, the entire spam message can be encapsulated in a brief line of text. However, current anti-spam technology does not provide any effective mechanism for a quick and statistically accurate analysis of e-mail headers and the characters of which the headers are composed. Thus, it would be useful to have an efficient mechanism for conducting quick header analyses of e-mail either alongside standard e-mail content scans, as a preliminary analysis to full content scans, or even as a standalone method for analysis of the brief content of some newer forms of short message spam.
Spam senders are also becoming more savvy to the potential for spam-filtering software to recognize words frequently used in spam e-mails and to recognize bulk e-mailings of spam. Spam-filtering software analyses often focus on detection of individual spam e-mail messages or on the association of e-mail messages with a similar group of bulk e-mailings based upon identifiable spam features. To elude automated spam-filter detection, spam senders sometimes disguise spam e-mail by adding randomization to e-mails and/or disguising typical spam e-mail terms. Thus, it would be useful to employ statistical methods to actually turn these evasion techniques against spam senders and to detect spam by the presence of randomization and/or feature disguise in e-mail.
What is needed are methods, computer readable media, and systems that employ statistical methods to allow for accurate spam detection with a quick scan of incoming e-mail headers or e-mail content, and which will also recognize and use as an advantage the randomization and feature disguise methods typically used by spam senders.