1. Field of Invention
The present invention relates to identification of text components and non-text components in an image document, such as implemented in optical character recognition applications.
2. Description of Related Art
Optical character recognition, or OCR, is a broad term applied to the general field of using machines to recognize human-readable glyphs, such as alphanumeric text characters and Chinese written characters, or more generally, Asian written characters. For each of explanation, both alphanumeric text characters and Chinese written characters are hereinafter referred to as “text” or “text character”. There are many approaches to optical character recognition, such as discussed in U.S. Pat. No. 5,212,741.
However, an integral part of the field of OCR is a step to first identify, i.e., classify pixels of an image as text pixels (if they are deemed to be part of a text character) or non-text pixels (if they are not deemed to be part of a text character). Typically, a collection of text pixels may be termed a text component, and a collection of non-text pixels may be termed a non-text component. Text pixels may then be further processed to identify specific text characters, such as Western text characters or Asian writing characters.
An integral part of the pixel classification process is the identification of foreground pixels, and to limit the classification process to the foreground pixels. Typically, connected components structures (i.e., CC structure) of the foreground pixels are constructed, and the pixels defined by the CC structures are classified as candidate pixels that may then be processed for classification as text pixels or non-text pixels.
Various approaches to distinguishing text pixels from non-text pixels of an image have been proposed. For example, U.S. Pat. No. 6,038,527 suggests searching a document image for word-shape patterns.
The process of identifying text pixels is complicated when an image document being processed has a mixture of text and non-text representations. That is, if the image document includes photo pictures or line illustrations, it is possible that some of these non-text regions may be erroneously identified as text region, resulting in the misclassification of pixels. At best, this slows down the overall process since non-text pixels are erroneously processed for text identification only to be rejected as non-text. At worst, processing of the misclassified text pixels may result in the misclassified pixels being wrongly identified as true text characters, resulting in a human-discernable error in the output.
This misclassification error is exacerbated in scanned documents. Text regions are typically restricted to foreground regions of an image, and thus an initial step to pixel classification is to separate the foreground pixels from the background pixels in a scanned document. Connected component, CC operations, are then conducted on the foreground pixels to identify candidate component (i.e., candidate pixels) for classification. Unfortunately, scanned documents typically develop artifacts throughout the scanned document, including within background areas. These artifacts appear as intentional markings within a background area and thus can be mistakenly identified as foreground pixels.
This issue is particularly acute in printed documents having colorful backgrounds and patterns, where halftone textures that are part of the printing process may show up as artifacts in its scanned representation. The artifacts cause the background to not be smooth or homogeneous leading to the artifacts being erroneously identified as foreground pixels subject to CC operations. Thus, the artifacts tend to become candidate pixels, at best, or erroneously identified as text characters, at worse.
What is needed is a method of minimizing the misclassification of photo pixels, line drawing pixels, etc., as text pixels.