The invention relates to systems and methods for classifying electronic communications, and in particular to systems and methods for filtering unsolicited commercial electronic mail (spam).
Unsolicited commercial electronic communications have been placing an increasing burden on the users and infrastructure of electronic mail (email), instant messaging, and phone text messaging systems. Unsolicited commercial email, commonly termed spam or junk email, forms a significant percentage of all email traffic worldwide. Spam takes up valuable network resources, affects office productivity, and is considered annoying and intrusive by many computer users.
Software running on an email user's or email service provider's system may be used to classify email messages as spam or non-spam. Spam messages can then be directed to a special folder or deleted. Several approaches have been proposed for identifying spam messages, including matching the message's originating address to lists of known offending or trusted addresses (techniques termed black- and white-listing, respectively), searching for certain words or word patterns (e.g., refinancing, Viagra®, weight loss), and analyzing message headers. Experienced spammers have developed countermeasures to such classification tools, such as misspelling certain words (e.g., Vlagra), and inserting unrelated text in spam messages.
A particular case of junk communication involves the transmission of digital images. These images may be offensive (e.g., adult content), or may be a form of conveying unsolicited information. Spammers try to avoid text-based detection by using digital images of words instead of actual text. One potential method of detecting such spam employs Optical Character Recognition (OCR) technology to convert images to text. Some OCR-based methods may be inaccurate and computationally expensive. To further complicate character recognition, spammers are known to use a range of so-called image obfuscation techniques, such as the addition of noise (random pixels), image distortion, interspersion of spam among sequences of animated image frames, and splitting individual images into multiple parts.