Unsolicited and/or undesired email is a significant problem for email administrators and users. A common category of undesired email is SPAM which is generally defined as bulk unsolicited email, typically for commercial purposes. Other categories of undesired email can be bulk email containing viruses and/or malware and the like or “phishing” messages which attempt to fool recipients into visiting misleading websites and/or revealing private information about the user.
At best, undesired email utilizes resources on email systems, occupies email recipient's time to review and delete the undesired emails and is generally frustrating and troublesome. At worst, undesired email can be malicious and can damage software, systems and/or stored data and/or can promote or cause identity theft and/or financial loss, etc.
Much work has been undertaken in recent years to combat the growing problem of undesired email. One of the more common methods used to date to reduce undesired email is the use of filters, such as Bayesian-based filters, to remove, flag or otherwise identify possible undesired email. With many filter systems, the content of received emails is examined for specified text, or patterns of text, to form a statistical decision as to whether the email is likely an undesired email.
However, as each new technical solution to detecting undesired email is introduced and deployed, the originators of undesired email alter their messages and/or sending techniques in attempts to circumvent the undesired email detection systems.
In particular, as the originators of undesired email identify the characteristics used by filter-based systems, such as Bayesian filters, to identify undesired email, the originators alter the content of their undesired emails in attempts to convince the filters that their emails are not undesired. For example, originators of undesired emails intentionally misspell keywords in the text of the email and/or insert additional sentences of innocuous words or phrases to defeat statistical analysis filters.
As the implementers of the filters are exposed to more and more examples of undesired emails which employ a variety of attempts at circumventing the filters, the filters are updated and retrained to become more robust and effective.
As the originators of undesired emails are limited in the techniques they can employ in their undesired messages because their messages must ultimately be readable and/or otherwise acceptable to the recipients of the messages, it has become increasingly difficult for the originators to succeed in getting their emails through the filters.
Most recently, originators of undesired emails have begun employing image-based messages in their undesired emails. In such image-based messages, an image, or picture, of the message text is sent in the undesired email rather than sending the text as a conventional character representation. For example, the undesired email may contain an image containing an image of a sexual nature with a URL for an associated pornographic web site also shown in the image. As most filter systems rely on an analysis of the text contents of undesired emails, along with header and other envelope information, to identify an undesired email, replacing the message text with an image, such as a PNG, GIF or JPG file, containing the text can deny the filter the necessary information to make a decision about the email while still permitting the recipient to read the message. In the above-mentioned example, conventional filter systems may fail to identify the pornographic web site email as the URL is not a text representation that will be available to the filter.
In an attempt to detect undesired email with image-based messages, some vendors of anti-spam software and systems have now added image hashing functions. These functions produce a hash value for an image in a suspected email and compare that hash value to hash values for previously processed images, allowing the system to recognize an image which has been previously identified as undesired. Unfortunately, most hash based systems are relatively easy to fool as originators of undesired messages need only randomly modify some number of image pixels so that the resulting image will no longer match a previously determined hash value.
In a more sophisticated attempt to detect undesired email with image-based messages, some vendors have added optical character recognition (OCR) functions. In these systems, the OCR functions are used to extract message text from the image in the email and then the message text is analyzed by a conventional filter system.
While such systems employing OCR functions can assist in identifying undesired email employing image-based messages, they do suffer from serious disadvantages. In particular, OCR functions are computationally expensive to operate and the cost of hardware systems to implement OCR enhanced anti-spam services is significantly higher than non-OCR systems. Further, a variety of obfuscation techniques which can inhibit or prevent OCR systems from recognizing text are well known and can be easily employed by originators of undesired emails to defeat OCR enhanced anti-spam systems. An example of a known obfuscation technique is the use of CAPTCHA s, such as that described in U.S. Pat. No. 6,195,698. to Lillibridge et al.
It is desired to have a system and method for making a statistical analysis of an email, in which at least a part of the message is image-based, to determine if the email is undesired.