1. Field of the Invention
The present invention relates to scanning images included in emails to determine whether those images include undesired textual content.
2. Description of the Related Art
The prevalence of unsolicited commercial email, commonly known as spam has grown rapidly and still growing. The corporate world and individual home users are spending millions of dollars to combat spam. Internet Service Providers (ISPs) have to cope with greatly increasing day-to-day amounts of network traffic due to the increase in spam emails. If spam traffic continues to grow, it may become unmanageable in the near future.
Typically, spam has been fought by the use of software that scans incoming email messages to determine whether each message is spam. Typical methods for scanning and detecting that an email message is spam include email filtering based on the content of the email, DNS-based blackhole lists (DNSBL), greylisting, spamtraps, enforcing technical requirements, checksumming systems to detect bulk email, and by putting some sort of cost on the sender via a Proof-of-work system or a micropayment.
Detecting spam based on the content of the email, either by detecting keywords or by statistical means, is very popular. Such methods can be very accurate when they are correctly tuned. As a result, spammers have resorted to other techniques for sending spam. One such technique is termed “image spam”. In image spam, the text of the message is stored as an image, such as a GIF or JPEG image, and displayed in the email or attached to the email. This prevents text-based spam scanners from detecting and blocking spam messages.
Often, image spam contains nonsensical, computer-generated text which simply annoys the reader. However, a significant percentage of the spam email contains images that actually provide the core meaning of a message. Those images are frequently embedded into the HTML part of the MIME message, being actually a MIME attachment, making it appear as an integral part of the content. In some cases the images are the only attachments in the otherwise blank messages. Either way, these images impose a serious challenge for spam blocking software based on content analysis. Some spam filters currently block any messages containing embedded images. While such filters eliminate image spam, they also block legitimate email having embedded images, such as signatures and logos. Some spam filters use optical character recognition (OCR) technology to attempt to find the text in images attached to email messages. However, OCR techniques are time consuming and inaccurate, missing some spam and blocking some legitimate messages.
A need arises for a technique for analyzing image attachments to email messages and reliably determining whether the image includes spam, so that the message can be blocked.