1. Field
Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques. In particular, various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
2. Description of the Related Art
Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
To address this spamming technique, fuzzy signature technologies, which flag both known and similar messages as spam, were deployed by anti-spam vendors. Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
Spammers now alter the images to make the email message appear different to signature-based filtering approaches yet while maintaining readability of the embedded text message to human viewers. The content of images lies in two levels: (i) the pixel matrix and (ii) the text or graphics these pixel matrices represent. At present, the notion of pixel-based matching does not make sense, as the same text could be represented by countless pixel matrices by simply changing various attributes, such as the font, size, color or by adding noise. Therefore, hash matching and other signature-based approaches have essentially been rendered useless to block image spam as they fail as a result of even minor changes to the background of the image.
Some vendors have attempted to catch image spam by employing Optical Character Recognition (OCR) techniques; however, such approaches have only limited success in view of spammers' use of techniques to obscure the embedded text messages with a variety of noise. FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. As shown in FIGS. 1A and 1B, polygons, lines, random colors, jagged text, random dots, varying borders and the like may be inserted into image spam in an attempt to defeat signature detection techniques and obscure the embedded text from OCR techniques. There are an almost infinite number of ways that spammers can randomize images. In addition to the foregoing obfuscation techniques, spammers have recently used techniques such as varying the colors used in an image, changing the width and/or pattern of the border, altering the font style, and slicing images into smaller pieces (which are then reassembled to appear as a single image to the recipient). Meanwhile, OCR is very computationally expensive. Depending upon the implementation, fully rendering a message and then looking for word matches against different character set libraries may take as long as several seconds per message, which is typically unacceptable for many contexts.