1. Field of the Invention
The present invention relates to identification of text in graphical images, particularly images that are suspected of being SPAM, and particularly SPAM-images with text that has been deliberately altered to defeat OCR and binary signature methods of SPAM detection.
2. Description of the Related Art
SPAM emails have become a veritable scourge of modern email systems. It is estimated that as much as 60-80% of Internet email traffic today is of a SPAM nature. SPAM, in addition to being annoying and wasteful of the time of the recipient, places a considerable burden on large email service providers and on corporate networks. For a regular user, the cost of SPAM that gets through is a few clicks that it takes to delete the offending message. For large-scale email providers, such as Google, Yahoo, Microsoft, as well as for large corporations that have their own server-based solutions for SPAM filtering, handling SPAM is a problem that needs to be solved on an industrial scale. For example, such large mail service providers need to filter millions of SPAM messages every hour.
A phenomenon observed recently is the increasing professionalization of SPAM generators. Many of the techniques used by SPAM generators closely mirror and borrow from the techniques used by many professional virus writers. At any given moment millions of computers connected to the Internet are zombified. In other words, these computers spew out vast numbers of SPAM emails, even though the owners of these computers are unaware of this fact.
Although in the early days of the SPAM epidemic it was possible to filter SPAM by looking for keywords, such as “Viagra,” “Hoodia,” “free offer” and so on, modern SPAM has evolved far beyond such simple and easily filterable examples. Today it is common to find SPAM embedded in images, which are, themselves, embedded within the body of the email or sometimes as an attachment to an otherwise innocuous email. The problem of filtering image-based SPAM presents a difficulty to the anti-SPAM software vendor, because filtering images generally requires a dramatically greater investment in hardware resources, compared to filtering simple text-based SPAM. Also, for large email service providers and corporate email servers, such SPAM filtering needs to deal with the SPAM more or less on-the-fly or within at most a few seconds, and it would be unacceptable if the SPAM filters delayed receipt of the email by significant time.
Image-based SPAM has thus far defied an effective solution primarily for two reasons. One is that SPAM generators have quickly learned to defeat signature-based methods of SPAM detection. If a known SPAM message (for example, a known SPAM GIF or a JPEG) is bitwise-compared to a suspect image, then by randomly changing a handful of bits in the images, the image remains virtually unchanged to the human observer but has a different signature (if only by a handful of bits).
For example, where three bytes are used to represent each color pixel, changing only one bit of one pixel of a large image would be undetectable to the naked eye, but nonetheless such an image would be treated as having a different signature by the signature method of SPAM detection. Any changes in the background of the image, the color of the background or the letters, distributing random “blotches” or “spots” in the background of the image and so on, all combine to defeat the signature-based method for detecting SPAM in images.
Another way to detect image-based SPAM is through optical character recognition (OCR). The OCR-based methods have two primary drawbacks: first, they are very resource intensive and are difficult for use in email systems that process large numbers of such emails per unit length of time. The second problem is that the accuracy of OCR systems is significantly less than 100%, raising the prospect of a false positive detection of a message which in fact is actually not SPAM.
In the industry, a false positive is generally regarded as a greater evil than letting through some number of SPAM messages, since very often an email falsely identified as SPAM by the SPAM filter will never be seen by its recipient, or, at best, it would be seen much later by the recipient. Additionally, just as spammers have learned to defeat the signature method of SPAM detection, some techniques that attempt to defeat OCR-based SPAM detection techniques have become available. They include writing text not in a straight line, but using wavy or curved lines, addition of noise to the image, spacing the letters in the words such so the words could not be recognized by OCR software, writing some letters at an angle, etc.
An example of a multi-stage method for analysis of raster images is described in Russian Patent No. 2234734. In this patent the first stage is a preliminary text identification using a less exact method, and then using more exact object identification for those objects that are left unidentified in the image. In the first stage the image is segmented into regions, tables, text fragments, text lines, words and symbols, and in the second stage the segmentation is further defined, using additional available information.
A similar principle is described in Russian Patent No. 2251151, where different objects in the image are divided into levels, based on the degree of complexity, such as a symbol, word, text line, paragraph, table and region. Each object is then associated with a particular level and the connections between objects of different levels and of the same level are then identified. Then a hypothesis is formulated regarding the properties of the various objects, which is later corrected, based on various image attributes.
Both of these methods essentially lack an identification of text elements in the image, which makes the processing of such images in real time a relatively difficult task.
U.S. Patent Publication No. 2004/0221062 describes a method where in order to identify contents of the image, a preliminary visualization of the image is done in a first format, and afterwards the message is transformed into a purely symbolic format, so that to filter out decorative components of an image, prior to text analysis. U.S. Pat. No. 7,171,046 also describes certain aspects of text identification in images.
U.S. Patent Publication No. 2005/0281455 describes a method of processing images and text in the images using a neural network. U.S. Pat. No. 6,470,094 describes a general localization methodology for text in images, where, as a symbol, several adjacent pixels are used, which, in turn, are combined into words.
U.S. Pat. No. 6,608,930 also describes identification of text in video images. In this patent, a first color is separated out, to enhance the contrast, which is further enhanced using a 3×3 map. Random noise is then removed using median filtration. The edge of the image is identified using an adaptive threshold, and then edges are removed from the image, so that portions of the image are deleted where there is no text, or where text cannot be reliably identified. Pixels that are close to each other are then combined into a single symbol, and then adjacent symbols are combined into words and then text lines. However, the method described in this patent is relatively calculation-intensive, and has not found substantial popularity.
Accordingly, there is a need in the art for an effective method of identifying text in images and an effective message of SPAM detection in emails that include embedded or attached images.