The proliferation of scanning technology combined with ever increasing computational processing power has lead to many advances in the area of document analysis and systems for such analysis. These systems may be used to extract semantic information from a scanned document, often by means of optical character recognition (OCR) technology. This technology is used in a growing number of applications such as automated form reading. The technology can also be used to improve compression of a document by selectively using an appropriate compression method depending on the content of each part of the page (e.g., bitmap image, graphical object image, text, etc.). Improved document compression lends itself to applications such as archiving and electronic distribution.
Document analysis can typically be broken into three stages. The first of these stages involves pixel color analysis. The second stage is document layout analysis, which identifies content types such as text, backgrounds and images in different regions of the page. The final stage uses the results of the first two stages and some further analysis to create a final output. The desired final output depends on the application. Some typical applications include document compression, OCR, and vectorization.
Pixel color analysis involves segmenting the scanned images into perceptually uniform color regions. The most common format is a binary image which can be obtained by various adaptive thresholding methods. Binary format is simple and effective for simple document images because generally documents have dark text on light background or light text on dark background. However as color printing becomes more advanced and widely used, and thus the choices of colors on documents also get more diverse, binary representation becomes ineffective. Some combinations of colors cannot be thresholded because they have similar luminance values. Common examples are yellow or green text on white background. Color pixel segmentation can be used to solve this problem. There are two common methods of color pixel segmentation. The first method is grouping similarly colored pixels together by giving them the same label. The second method is color quantization. Some applications also use the combination of the two methods. FIG. 17(a) and FIG. 18(a) are two examples 1900 and 2000 of typical scanned images for document analysis. The images 1900 and 2000 each include text and a background representing text data and image data used to form a compound document. FIG. 17(b) and FIG. 18(b) are corresponding binarized images derived from FIG. 17(a) and FIG. 18(a). FIG. 17(c) and FIG. 18(c) are the typical output of color pixel segmentation from the same input.
All the pixel segmentation methods mentioned above suffer from noise because a huge degradation on the scanned images has occurred from the original raster image processed (RIP'ed) images through the original printing and subsequent scanning processes. Artifacts such as noise in scanned documents affects the accuracy of image type classification for document analysis applications. The types of noise include halftone, registration error, bleeding and JPEG artifacts. Halftone noise is generally the most critical type. Most document analysis applications either employ a pre-processing stage to remove halftone noise or to embed halftone detection in image type classification with a geometrically coarse classification. Other noise is normally removed by removing small blobs or regions of labelled pixels or recursively merging small blobs. The noise removal process is time consuming and implementation costly.
In document copy applications (scan then print), the detection and removal operations of halftone noise are often done at the same time by a moving window with the centre pixel being the target. The process produces a set of pixels for the centre pixel using different filters in parallel. The classification state of the centre pixel decides which of the manipulated pixels should be output for the centre pixel. The size of the window normally is quite small. While this implementation is very efficient in hardware, it suffers inaccuracy in detection due to the lack of context information. Better context information can be achieved if the window is bigger. However the hardware implementation cost increases significantly when the window size increases. A software implementation for this method is not desirable because filter operations are very slow in software.
In document analysis applications such as OCR, image region segmentation and automatic form processing etc., the halftone detection process is often embedded into image region classification. Inage region classification schemes can range from a simple classification such as halftone/non-halftone to a more complicated classification such as flat/contone/rough halftone/smooth halftone/graphics. The classification normally works in larger non-overlapped blocks. A block of pixels is analysed and a set of features are extracted. The block is then classified into a predefined type based on the extracted features. For a block classified as halftone, a blurring function is normally applied to the whole block to remove halftone. This method is faster than the moving window style in software implementations and the accuracy generally is also higher because the block normally provides better context information. However this method suffers from geometrically coarse classification because each block can only have one classification type in contrast to each pixel having a classification type in the moving window style.
Geometrically coarse halftone classification affects text visual quality especially in the case of text over halftone. Without halftone removal, the background halftone noise can be incorrectly selected as foreground text and subsequently affects further text processing such as OCR. On the other hand, if a text over halftone block is halftone removed by applying a blurring function to the block, the text in that block will appear blurry. Blurry text can cause broken or missing text strokes when the image is pixel segmented into perceptually uniform color regions. This will directly affect the OCR results.
There is a need for a pixel color analysis technique that segments scanned images with high accuracy thereby accommodating pixel level detail.