The invention relates to a system and method of digital image processing; more particularly, to automatic color dropout using luminance-chrominance space for high speed document scanning.
In document image processing there is a need to extract textual information from an image that has color content in the background. Removal of certain color content is useful in specific applications, such as forms processing, where the color content on the form, which is used to facilitate data entry, adds no value to subsequent data processing. Color dropout reduces the image file size, eliminates extraneous information, and simplifies the task of extracting textual information from the image for the reader or processing system.
One application where color dropout is important is in the field of optical character recognition (OCR). The electronic color form dropout is a desired feature in form processing because it eliminates the interference of form structure from the text of interest, which reduces the complication of optical character recognition (OCR) application. In the OCR process, a document is scanned electronically, which converts the data on the form to a digital image. Once the data is captured in electronic form, the information to be read is separated from the background information, such as boxes and text with instructions on how to complete the form. This process results in the elimination of all but the desired information. Once this separation is accomplished, the text fields of the image are extracted and processed by an OCR algorithm.
A scanning system capable of capturing an image in color produces a digital image file with three color components, such as red, green and blue (xe2x80x9cRGBxe2x80x9d). The number of pixels in the color image depends on the resolution of dots per inch resolved by the camera optics and detector. The numerical value at each pixel of a color component represents the amount of the particular primary color detected at that pixel. In cases where all three color components have the same value, the resultant image is said to be a shade of gray. As the intensity of each color component is reduced, the gray appearance turns black.
Business forms are often printed with some background color, for example, a pastel color. One way of eliminating this background color is to use an optical filter in the electronic scanner, matched to the background color to be eliminated. The color filter prevents the scanner detector from discerning information printed in that particular color, therefore, the pastel background appears white to the scanner. The text printed in black or any color other than the filter color is captured by the scanner. This system limits the dropout colors to the particular filter installed on the scanner, which must match the background color on the forms. In other words, this system requires different filters for different color forms and is limited to dropping a single color.
Other available systems and methods that automatically identify the color of the desired data and eliminate background colors do not address certain needs. One such system/method for extracting data from business forms is automatic color dropout using luminance-chrominance space. Typically, the digital image generated is bi-tonal, such as black and white, or two different grayscale values. However, special problems are created by business forms that have been typed on various brands and styles of typewriters. Also, people use different types of pens and inks, such as dark blue ink, to fill in and sign business forms. This can also create problems in character and color recognition. Colors may vary from form to form. Also, achieving high resolution occurs at the expense of document scanning throughput.
Another problem with conventional systems and methods is that they do not address the adverse effects of inherent color noises on the precision and reliability of electronic color dropout. Inherent color noises are frequently induced in a scanning process by chromatic aberration and mis-registration of red, green, and blue (RGB) signals. A business form normally contains a finite number of uniform colors. Analysis of an electronic version of a business form that has been captured by flatbed scanners or rotary-type scanners reveals thousands of extra colors on the edges of image objects, such as lines and characters. These extra colors are called color fringes. Color fringes do not exist in the original business form documents. The occurrence of these false colors results in confusion of color dropout algorithms based on the minimum distance measures adopted in certain conventional methods, as described in commonly owned U.S. Pat. No. 6,035,058, Savakis et al, issued Mar. 7, 2000. For example, the color of an image pixel near an edge to be retained may be identical to the color of interest to be dropped out. These extra colors generated in a scanning process illustrate the difficulty in attempting to achieve perfect color dropout without losing some edge pixels of image objects. The color dropout technique of the present invention minimizes image information loss while eliminating the color of interest. In addition, the present invention supports dropping multiple colors and is even capable of determining the colors to be dropped.
The method of the present invention includes a color dropout technique suitable for high speed document scanning, which minimizes image information loss while completely eliminating the color of interest, even given a wide variety of color business forms. The present invention allows color dropout in two or, if desired, three dimensions. The two dimensional system allows simplification of the hardware required to achieve consistently clear images of data on a variety of business forms. With the present invention, hardware and look-up tables are smaller when compared to other available systems, and system implementation is simpler. In the method of the present invention, a stack of documents is distinguished, colors are selected based on the particular form, colors are detected by original scanning of the form in a color space, and then the image is processed to obtain two-dimensional color maps.
The present invention is an automatic method for processing a color image, comprising the steps of:
a) detecting color in a color form by scanning the color form in color space, preferably in red, green, and blue (RGB) color space, forming a digital color image, and converting the digital color image to a two-dimensional binary image in chrominance space, and, optionally, a three-dimensional binary image in luminance-chrominance space to determine the color or colors to be dropped; and
b) conducting a color form dropout process.
An image processing system for automatic color dropout is also included herein. It includes:
1) a color detection system, comprising:
(1a) a color scanner for scanning a color document and providing a digital image;
(1b) a means for converting the color digital image into luminance-chrominance space;
(1c) a means for detecting a background gray level, and assigning it to a Background Value;
(1d) a means for measuring color distribution;
(1e) a means for detecting the number of colors and their distributions;
(1f) a means for generating a color drop table for each color present; and
2) a color dropout system, comprising:
(2a) a color scanner for scanning a color form document and providing a digital image;
(2b) a means for converting the color digital image into luminance-chrominance space;
(2c) a means for storing the color drop table;
(2d) a means for accessing the color drop table;
(2e) a means for applying a color dropout map to the digital image; and
(2f) a means for replacing a pixel value with the Background Value based on the color drop table.