In production scanning environments, a scanned document may have color content that is not pertinent to the data that is needed from the document. For example, pre-printed applications, tax forms, and other documents can contain form color areas, including printed instructions, lines, boxes, or symbols that guide the user of the document to fields that require human entry, where the entered information is typically in pencil or dark ink. Many types of pre-printed forms use pre-printed location markings for character entry, thus confining entered characters or other markings to specific locations and sizes. The use of such location markings then facilitates optical character recognition (OCR) scanning that automates reading of character content entered by the person who completed the form.
For the purpose of clear description, the present application uses the term “form color” to identify color content that can be ignored and “dropped” from the scanned image data for a scanned form or other document. The form color is non-neutral, so that the red (R), green (G), and blue (B) data values corresponding to a form color differ from each other. The data of interest on a scanned form or other document is dark neutral data, termed “neutral data” in this application. Neutral data represents any user-entered text markings, such as those that might have been made on a form in pen or pencil, or printed data that is entered into a form or document. In many applications, neutral data that is scanned from a form or other document is further processed using optical character recognition (OCR) or other utilities. The term “background color” has its conventional meaning as the term is used in the document scanning arts. That is, a background color is generally the color of the medium upon which text or form content is entered or printed. Typically a neutral color, such as white or off-white in most cases, background color could also be a non-neutral color, such as where a document is printed on a colored paper or other colored medium. In bitonal scanning, for example, the background color is preferably shifted to a white or very light grey color, to heighten the contrast between the background and text or form color content.
In order to store and process such scanned documents more efficiently, it is useful to remove unwanted form color from scanned document image data. Conventionally, this has been done in a number of ways. A number of approaches for scanning pre-printed documents such as forms use information known beforehand about the spectral content of the pre-printed documents themselves and use scanning hardware that is suitably adapted to eliminate this spectral content. For example, Reissue Patent RE29,104 (Shepard) utilizes a laser scanner unit adapted to scan a document, wherein the wavelength of the laser is matched to the color of the markings on the documents, so that the light reflected from the markings has the same intensity as the light reflected from the document background. The pre-printed character location markings are thus “blinded” and do not interfere with the reading of the characters. In other approaches, various types of optical filters have been employed, again, with foreknowledge of colors expected on the pre-printed form.
Other approaches for separating the neutral data of interest from the form color operate on the color data itself. For example, U.S. Pat. No. 5,335,292 (Lovelady et al.) describes a remapping of color data to the background, effectively “blinding” an OCR system to unwanted colors on the document, again wherein the colors are known beforehand. Training can also be used, so that a scanning system “learns” how to process a set of documents. However, training has a number of pitfalls. For example, a separate training operation and utility are required for training. Training applications are restrictive as to color and, in many cases, generally work well only when the scanned document has a high level of content in one of the red, green, or blue color channels. Training is not only time-consuming, but also requires that a properly trained operator be on-hand to review and verify results.
Solutions such as those described can be used to remove form color content in situations where the unwanted form color or colors are known beforehand. However, these solutions constrain color dropout for any scanning system so that it can only be used with a specific set of documents. With hardware solutions such as color filters or use of scanning light having a certain wavelength, the scanning optics are matched to the document, so that color dropout is available only for documents having that specific color. Image processing solutions that check for certain form colors are similarly limited, although such systems can be more easily “retrained” or re-programmed to identify and remove other colors. Nevertheless, solutions looking for a specific form color or set of colors do not provide a flexible solution that can be used with a broad range of documents having color content. This can have a negative impact on workflow, for example, since it requires manual sorting of documents with different form colors so that they are directed to different scanning systems. Other, more subtle problems include differences between ink batches and print runs, causing shifts in spectral content for documents that are of the same type, but were printed at different times or locations.
In an attempt to provide a more flexible color detection and dropout scheme, U.S. Pat. No. 7,085,413 (Huang et al.) describes the use of a color histogram obtained from the scanned document, wherein a dominant color can be identified and removed if it exceeds a threshold luminance. This type of approach is at least more dynamic than approaches described earlier that required prior knowledge of the unwanted color or colors. However, the approach described in the '413 Huang et al. disclosure and similar approaches that simply remove entire color channels in order to remove unwanted form colors risk discarding desired information from the scanned data and offer limited performance, particularly where differences between form colors and color content may vary widely. Such solutions may be acceptable where documents have a substantial amount of color content, at least half by area of a single color, for example, or where a document is provided on a colored paper stock. However, such an approach is not well suited for scanning documents that may have some small amount of color content or may have multiple colors.
Ideally, a color dropout scheme preserves grayscale neutral data content in a document, such as pencil marks or pen marks or entered dark text content such as from a printer, so that this content can be stored or used for further processing, such as for OCR processing or mark-sense applications. An acceptable color dropout scheme would discard unwanted form color, dropping color pixels of one or more form colors into the document background, without compromising the quality of the neutral data. Moreover, it would be highly advantageous for a scanning system to have a color dropout method that automatically adapts to paper stocks having different background colors, that identifies the form color content independently on each scanned document, and that takes the necessary steps to remove form color while preserving the desired information that is provided as neutral data.