1. Field of the Invention
The present invention relates generally to computer security, and more particularly but not exclusively to methods and apparatus for identifying text content in images.
2. Description of the Background Art
Electronic mail (“email”) has become a relatively common means of communication among individuals with access to a computer network, such as the Internet. Among its advantages, email is relatively convenient, fast, and cost-effective compared to traditional mail. It is thus no surprise that a lot of businesses and home computer users have some form of email access. Unfortunately, the features that make email popular also lead to its abuse. Specifically, unscrupulous advertisers, also known as “spammers,” have resorted to mass emailings of advertisements over the Internet. These mass emails, which are also referred to as “spam emails” or simply “spam,” are sent to computer users regardless of whether they asked for them or not. Spam includes any unsolicited email, not just advertisements. Spam is not only a nuisance, but also poses an economic burden.
Previously, the majority of spam consisted of text and images that are linked to websites. More recently, spammers are sending spam with an image containing the inappropriate content (i.e., the unsolicited message). The reason for embedding inappropriate content in an image is that spam messages can be distinguished from normal or legitimate messages in at least two ways. For one, the inappropriate content (e.g., words such as “Viagra”, “free”, “online prescriptions,” etc.) can be readily detected by keyword and statistical filters (e.g., see Sahami M., Dumais S., Heckerman D., and Horvitz E., “A Bayesian Approach to Filtering Junk E-mail,” AAAI'98 Workshop on Learning for Text Categorization, 27 Jul. 1998, Madison, Wis.). Second, the domain in URLs (uniform resource locators) in the spam can be compared to databases of known bad domains and links (e.g., see Internet URL <www dot surbl dot org).
In contrast, a spam email where the inappropriate content and URLs are embedded in an image may be harder to classify because the email itself does not contain obvious spammy textual content and does not have a link/domain that can be looked up in a database of bad links/domains.
Using OCR (optical character recognition) techniques to identify spam images (i.e., images having embedded spammy content) have been proposed because OCR can be used to identify text in images. In general, use of OCR for anti-spam applications would involve performing OCR on an image to extract text from the image, scoring the extracted text, and comparing the score to a threshold to determine if the image contains spammy content. Examples of anti-spam applications that may incorporate OCR functionality include the SpamAssassin and Barracuda Networks spam filters. Spammers responded to OCR solutions in spam filters with images deliberately designed with anti-OCR features. Other approaches to combat spam images include flesh-tone analysis and use of regular expressions.
The present invention provides a novel and effective approach for identifying content in an image even when the image has anti-OCR features.