In document image processing, many applications require the extraction of textual information from an image that has color content in the background. Removing colored backgrounds in some documents is useful in specific applications, such as forms processing. In some forms, colored backgrounds are provided for different data fields to facilitate data entry. Retention of the color data obtained in scanning is unnecessary for subsequent data processing. Therefore, removing the colored backgrounds, also known as color dropout, in such documents reduces the image file size, eliminates extraneous information, and simplifies the task of extracting textual information from the image for a processing system.
One application where color dropout is important is in the field of optical character recognition (OCR). Electronic color form dropout helps eliminate the lines and colors surrounding the text of interest so the character forms may be more readily recognized by the optical character recognition (OCR) application. In the OCR process, a document is scanned electronically, which converts the data on the form to a digital image. This digital image data may then be processed to remove background information, such as boxes and instructions for completing the form. One aspect of removing the background information may include color dropout for color forms or forms have color encoded data fields.
A scanning system typically generates a digital image file with three color components, such as red, green and blue (“RGB”). The number of pixels in the color image depends on the resolution of the scanning components. The numerical value at each pixel of a color component represents the density of the particular color component for a corresponding pixel.
One way of removing pixels of a particular color is to use an optical filter during scanning. The filter effectively blocks the color that matches the optical filter so that particular color appears white to the scanner. Any printing in black or any color other than the filter color is captured by the scanner. While this system is able to dropout a color from a document, it requires different filters for different colors and only one color can be filtered in a scan.
Systems are also known that process the digital image data generated by a document scan to dropout certain colored pixels. One such system is the one disclosed in U.S. Pat. No. 6,035,058, which issued to Savakis, et al. on Mar. 7, 2000. This type of system compares a measured distance between a point in a color space and a non-dropout color to a minimum threshold and determines whether a pixel should be white or black. One limitation of this type of system is the inability of the system to dropout shades of colors. Consequently, artifacts of a color background may arise from the incomplete removal of color that appears in slightly different shades during a scan. The different shades may arise from the quality of the document scanned or from fringe effects or the like during scanning.
Dropout color processing is also performed in monochrome printers that convert color documents before printing them in either black and white or grayscale. Such systems typically convert the color image data to a chrominance-luminance color space and then compare the converted data to preset thresholds to determine whether to print the pixel as a white, black, or grayscale pixel. The image data processors in these types of printers do not detect shades of colors well.