In document image processing there is a need to extract textual information from an image that has color content in the background. The removal of the color content is useful in specific applications, such as forms processing, where the color content on the form, used to facilitate data entry, adds no value to subsequent data processing. Color dropout reduces the image file size, eliminates extraneous information, and simplifies the task of extracting textual information from the image for the reader or processing system.
An example of an application where color dropout is important is in the field of optical character recognition (OCR). In the OCR process, a document 10, an example of which is shown in FIG. 1, is scanned electronically, converting all the information to a digital image. Once the data is captured in electronic form, the information to be read is separated from the background information, such as boxes and text with instructions on how to complete the form. This process results in the elimination of all but essential information, as shown in FIG. 2. Once this separation is accomplished, the text fields of the image are extracted and processed by an OCR algorithm.
A scanning system capable of capturing an image in color produces a digital image file with three color components. The number of pixels in the color image depends on the resolution of dots per inch resolved by the camera optics and detector. The numerical value at each pixel of a color component represents the amount of the particular primary color detected at that pixel. In cases where all three color components have the same value, the resultant image is said to be a shade of gray. As the intensity of each color component is reduced, the gray appearance turns black.
Business forms are typically printed with some background color, for example, a pastel color. One way of eliminating this background color is to use an optical filter in the electronic scanner, matched to the background color to be eliminated. The color filter prevents the scanner detector from discerning information printed in color, therefore, the pastel background appears white to the scanner. The text printed in black or other dark color is captured by the scanner. This system works, but limits the dropout colors to the filter installed on the scanner which must match the background color on the forms. Thus, different color forms require different filters.
In one color dropout system currently available, codes are stored in a lookup table for dropout of cyan, magenta, or yellow (CMY). See U.S. Pat. No. 4,727,425. Another method of determining dropout colors is disclosed in U.S. Pat. Nos. 5,014,328 and 5,014,329, wherein the dropout color is selected as an average color of a calibration zone or patch of the document to be scanned. The coefficients of a color filter are selected to tune-out the red, green, blue (RGB) of the dropout color. Another approach is disclosed in U.S. Pat. No. 5,664,031, wherein a blank form is scanned and all the RGB color information is stored in memory. The stored blank form is then digitally compared with the completed form for the purpose of color dropout.
A system that automatically identifies the color of the desired textual information and eliminates all other colors is desirable.