The invention relates to methods and systems for classifying electronic communications, and in particular to systems and methods for filtering unsolicited commercial electronic mail (spam).
Unsolicited commercial electronic communications have been placing an increasing burden on the users and infrastructure of electronic mail (email), computer messaging, and phone messaging systems. Unsolicited commercial communications, also known as spam, forms a significant percentage of all email traffic worldwide. Spam takes up valuable network resources, affects office productivity, and is considered annoying, intrusive, and even offensive by many computer users.
Software running on an email user's or email service provider's system may be used to classify email messages as spam or non-spam (also called ham). Current methods of spam identification include matching the message's originating address to lists of known offending or trusted addresses (techniques termed black- and white-listing, respectively), and searching for certain words or word patterns (e.g. refinancing, Viagra®, weight loss).
Spammers constantly develop countermeasures to such anti-spam methods, which include misspelling certain words (e.g. Vlagra), using digital images instead of words, and inserting unrelated text in spam messages (also called Bayes poison). Spam identification may be further complicated by frequent changes in the form and content of spam messages.
To address the ever-changing nature of spam, a message classification system may include components configured to extract characteristic features from newly arrived spam waves, and anti-spam filters configured to classify incoming messages according to these characteristic features. In a common approach, human supervision is employed to define spam identification signatures to be used for classifying incoming messages. Human supervision may allow identifying relatively accurate/effective signatures. At the same time, since spam waves often appear and change rapidly, sometimes within hours or minutes, a responsive human-supervised system may require a significant amount of human labor.