The invention relates to systems and methods for classifying electronic communications, and in particular to systems and methods for filtering unsolicited commercial electronic mail (spam).
Unsolicited commercial electronic communications have been placing an increasing burden on the users and infrastructure of electronic mail (email), instant messaging, and phone text messaging systems. Unsolicited commercial email, commonly termed spam or junk email, forms a significant percentage of all email traffic worldwide. Spam takes up valuable network resources, affects office productivity, and is considered annoying and intrusive by many computer users.
Software running on an email user's or email service provider's system may be used to classify email messages as spam or non-spam. Spam messages can then be directed to a special folder or deleted. Several approaches have been proposed for identifying spam messages, including matching the message's originating address to lists of known offending or trusted addresses (techniques termed black- and white-listing, respectively), searching for certain words or word patterns (e.g. refinancing, Viagra®, weight loss), and analyzing message headers. Experienced spammers have developed countermeasures to such classification tools, such as misspelling certain words (e.g. Vlagra), using digital images instead of words, and inserting unrelated text in spam messages. Such countermeasures have made the identification of spam more difficult.
Conventional anti-spam filtering approaches include black- and white-listing email addresses, as well as filtering based on keywords. For example, in U.S. Pat. No. 6,421,709, McCormick et al. describe a system and method of filtering junk e-mails using a first filter and a second filter. The first filter includes a list of disallowed sender email addresses and character strings. The second filter includes allowed sender names and character strings which the user wishes to receive. The first filter eliminates emails from the system, while the second filter directs emails to the user's inbox.
In U.S. Pat. No. 6,161,130, Horvitz et al. describe a system which scans messages for a number of predefined features such as word-based features, capitalization patterns, and specific punctuation. The scan yields an N-dimensional feature vector, which is fed into a probabilistic classifier. The classifier is trained for prior content classification, and effectively bins the message into one of several classes, e.g., spam or non-spam, according to a computed confidence level. A user trains the classifier by manually labeling a set of messages as legitimate or spam.