1. Field of the Invention
The present invention relates to a method for extracting text from an image which is typically displayed on web pages of the World Wide Web.
2. Description of the Related Art
World Wide Web search engines typically search web pages for keywords appearing as text. In addition, visual search engines, such as AV Photo Finder available through AltaVista.com, search for images resembling conceptual keywords such as face or horse. Neither of these methods is suitable for searching for keywords appearing within graphical images on web pages. The present invention overcomes these shortcomings by presenting a method of extracting textual words from graphical images on web pages to allow keyword searching on these images.
Graphical images on web pages are typically logos, clip art, function buttons (typically labeled xe2x80x9cSubmitxe2x80x9d, xe2x80x9cCancelxe2x80x9d, xe2x80x9cOKxe2x80x9d, etc.), or photographs, many of which include text within the image. This image text may not necessarily appear as searchable text (such as in ASCII format) on the web page. These images are typically in either Graphics Interchange Format (GIF) or Joint Photographic Experts Group format (JPEG).
The GIF format allows an image to present up to 256 different composite colors, as defined in the header of the GIF file. Each composite color is represented by three color components (or channels) or red, green and blue (RGB) with intensity levels varying from 0 to 255. Black is represented by (R,G,B)=(0, 0, 0), white by (255, 255, 255) and full intensity red by (255, 0, 0).
The JPEG format allows a much wider range of composite colors to more accurately represent the continuous range of colors present in photographic images. In JPEG 24-bit images, for example, over 16 million composite colors may be used. The JPEG color channels (typically luminance and chrominance to maximize file compressibility) may be readily converted into RGB (red, green, and blue) color components.
Various methods are known in the general art of image recognition. Several of these methods involve analyzing single images where the images are associated with a scanner of a photographic development system. One such method is taught by the patent to Ikeshoji, et al., U.S. Pat. No. 5,761,339 which discloses separating a background image from image data, processing the background image by a maximum filter, comparing the brightness of each pixel constituting said image data with the brightness of peripheral pixels, and replacing it with the maximum brightness. This method only evaluates the image with respect to the varying degrees of pixel brightness within the image. The Gormish, et al., U.S. Pat. No. 5,659,631 discloses a data compression system that separates input into color planes prior to compressing the image. Additionally, the image data could be coded using pixel information as context. This method utilizes color planes in its evaluation of the underlying image. Furthermore, the patent Vincent, et al., U.S. Pat. No. 5,010,580 discloses a system for extracting handwritten or typed information from forms that have been printed in colors other than the color of the handwritten or typed information. The information extraction system includes a detector for detecting color values for scanned pixel locations on a printed form; a comparator for comparing the color values with reference color values; and identifier for identifying ones of the scanned pixel locations that have color values that correspond to the reference color values; an optical character recognition engine for receiving data regarding the identified locations.
The present invention was developed to overcome the drawbacks of the prior art and to provide improved image evaluation and extraction methods characterized by identifying the number of unique intensity levels present in each of the color components of the subject image, reducing the number of intensity levels to a small number (if necessary), converting each of the remaining number of intensity levels for each color component to a black and white image, reversing all black and white pixels within the image if the number of black pixels exceeds the number of white pixels, performing character recognition of each black and white image, evaluating the text output by the character recognizer to determine success or failure, and stopping the processing when the text is successfully or all intensity levels for all color components have been processed.
Accordingly, a primary object of the present invention is to provide a method and system for quickly and efficiently extracting text from a graphical image that is typically found on a web page, characterized by the steps of separating the image into its color components, determining the image intensity levels contained within the color planes of each of said color components, respectively, scanning in a pixel-by-pixel manner for at least one first color component the color plane of said first color component having the highest intensity level, comparing the intensities of successive color pixels with said highest color intensity level and generating corresponding black and white pixels for those color pixels having intensities equal to and other than said highest intensity level, respectively, and recognizing the text characters of the black pixels in the event that the number of white pixels exceeds the number of black pixels. In the event that the number of black pixels exceeds the number of white pixels, converter means reverse the colors of the black and white pixels before recognition of the text by the text recognizing means.
According to one embodiment of the invention, the color levels of a first color component of the image are scanned by the scanning means. According to a second embodiment, intensity lever comparison means are used to determine the highest color plane of any of the color components that is to be scanned by the scanning means.
According to a more specific object of the invention, the method for recognizing text within a colored image includes identifying the intensity of each color plane of the image, and then selecting the color plane with the highest intensity level. The text of the image is then extracted from the determined color plane. The extracted text is converted into a black and white image, whereupon an evaluation takes place to determine if there are more black pixels on the resultant image than there are white pixels. If there are more black pixels, the image is inverted, thus changing every black pixel to a white pixel, and likewise, every white pixel into a black pixel. After the image has been properly formatted with the appropriate white and black pixel arrangement, a character recognition engine evaluates the black and white image to determine the textual characters. If the character recognition engine is successful, the method is complete. However, if the recognition was not successful, the next lowest intensity level of the initial selected color plane is identified, and the text extraction and character recognition process is repeated. If this process does not yield a positive character recognition result, the method evaluates the color plane which has the next smallest intensity level in relationship to the first color plane that was selected, and the same evaluation method is performed on that color plane. If that process does not yield a successful recognition, the third color plane is evaluated. If the third color plane does not yield a successful recognition, the text of the image will not be recognizable by this method.
The character recognition will be determined to be successful by comparing the extracted text to a lexicon of legitimate words, computing the likelihood of the sequence of characters (e.g. xe2x80x9cQxvxe2x80x9d is highly unlikely in English, whereas xe2x80x9ccomxe2x80x9d is relatively common), or character recognition software which can provide a confidence measure for each character based upon how well its pixels matched the nominal template or features.