1. Technical Field
The present invention relates generally to optical character recognition and computer security.
2. Description of the Background Art
Electronic mail (“email”) has become a relatively common means of communication among individuals with access to a computer network, such as the Internet. Among its advantages, email is relatively convenient, fast, and cost-effective compared to traditional mail. It is thus no surprise that a lot of businesses and home computer users have some form of email access. Unfortunately, the features that make email popular also lead to its abuse. Specifically, unscrupulous advertisers, also known as “spammers,” have resorted to mass electronic mailings of advertisements over the Internet. These mass emails, which are also referred to as “spam emails” or simply “spam,” are sent to computer users regardless of whether they asked for them or not. Spam includes any unsolicited email, not just advertisements. Spam is not only a nuisance, but also poses an economic burden.
Previously, the majority of spam consisted of text and images that are linked to websites. In the last few years, spammers are sending spam with an image containing the inappropriate content (i.e., the unsolicited message). The reason for embedding inappropriate content in an image is that spam messages can be distinguished from normal or legitimate messages in at least two ways. For one, the inappropriate content (e.g., words such as “Viagra”, “free”, “online prescriptions,” etc.) can be readily detected by keyword and statistical filters (e.g., see Sahami M., Dumais S., Heckerman D., and Horvitz E., “A Bayesian Approach to Filtering Junk E-mail,” AAAI'98 Workshop on Learning for Text Categorization, 27 Jul. 1998, Madison, Wis.). Second, the domain in URLs (uniform resource locators) in the spam can be compared to databases of known bad domains and links (e.g., see Internet URL <http://www.surbl.org/>).
In contrast, a spam email where the inappropriate content and URLs are embedded in an image may be harder to classify because the email itself does not contain obvious “spammy” textual content and does not have a link/domain that can be looked up in a database of bad links/domains.
Similarly, other messages (besides email) may also have embedded images with sensitive text content. It may be desirable to filter the messages for such image-embedded text content, for example, for data leakage or compliance applications.
Extracting text content from images can be a difficult problem, especially for identifying languages with large alphabets (character sets) such as Chinese, Japanese, and other languages with large numbers of characters. The large alphabets (character sets) for Chinese and Japanese each include over two thousand distinct characters. Such languages cause automatic content filtering software to be less useful when dealing with images. For example, an anti-spam engine may fail to detect a spam email with only a picture in it and what the spam email want to say are represented by image format.
Using OCR (optical character recognition) techniques to identify spam images (i.e., images having embedded “spammy” content) have been proposed because OCR can be used to identify text in images. In general, use of OCR for anti-spam or other content-sensitive message filtering applications would involve performing OCR on an image to extract text from the image, and comparing the extracted text with pre-defined spammy or other content-sensitive terms to determine if the image contains that content.