Electronic-mail (email) security and maintaining that security occurs in a hostile environment. Spammers are constantly attempting to “beat the system” by coming up with new ways to avoid a determination that their message is spam. Spam detection is often based on the content of a message, for example, offers for a Rolex® watch or sales of Viagra® pills. By detecting messages that repeatedly use the terms Rolex or Viagra and/or those terms in the context of other information such as a phone number or email address, a determination might be made that a message is spam.
This determination is often made through parsing of the text of the message and identifying key words or words suggestive of spam content. These methods are not always effective and may result in a false positive that a message is spam. For example, a message might be an exchange between medical professionals concerning Viagra or a buyer purchasing a Rolex® watch from a seller via an online auction or shopping website such as eBay.com or Amazon.com.
Spammers have now begun to embed their spam messages (e.g. disruptive, unwanted, or unsolicited messages) in an image (image based spam). Image based spam is a message where the text is embedded in an image, which makes the message more difficult to parse than text or ASCII based messages. In some instances, a spammer will prepare a message, take a screen shot of the image, and embed the image in the message without any surrounding ASCII or text. Since there is no text to parse in the image based spam message, every word the spammer intends to convey remains illustrated in the image. Traditional spam detection techniques or filters cannot effectively detect or stop these image based spam messages.
Optical character recognition (OCR) may be used to identify words in a message. Similar to pure text-based parsing techniques, the words in the message are parsed and a determination is made as to whether the message is spam. OCR is, however, slow, computationally intensive, and easily fooled. For example, by rotating an image by only a few degrees where a line of text in an image now appears on a slant, the OCR recognition software may require additional computational cycles thereby delaying processing and/or result in incorrect character recognition, which may lead to failed recognition of the message as spam all together.
Spammers may also insert random noise such as dots or other background artifacts. Noise and artifacts make it difficult for OCR recognition techniques to identify the lines of text that may encompass a spam message. The human eye may process the content of the message without any issue, but computer imaging and processing techniques may not operate as well in the OCR context with such noise present.
Traditional spam filters or spam detection methods that assess the content of a message are thus proving to be ineffective against image based messages. There is a need for a context insensitive message detection technique that effectively detects and blocks image based spam messages.