The present application is directed to document image analysis, and more particularly to automated differentiation between different types of markings found on documents.
An automated electronic based system having the capability for such detection has uses in a number of environments. For example, in legal document discovery it is valuable for lawyers to be able to quickly narrow down, from millions of pages, those pages which have been marked on. Also, in automated data extraction, absence of handwritten marks in a signature box can be translated to mean the absence of a signature. Further, being able to tell noise marks apart from machine printed marks can lead to better segmentation for optical character recognition (OCR). It is therefore envisioned one area the present system will find use is in the context of forms, where printed or handwritten text may overlap machine printed rules and lines.
Identifying granular noise (sometimes called salt and pepper noise), line graphics, and machine print text have received the most attention in document image analysis literature. The dominant approaches have relied on certain predictable characteristics of each of these kinds of markings. For example, connected components of pixels that are smaller than a certain size are assumed to be noise; large regions of dark pixels are assumed to be shadows; and long straight runs of pixels are assumed to come from line graphics. Identification of machine print text is an even more difficult task. In commercial OCR packages, systems for the detection of machine printed regions have been heavily hand-tuned, especially for Romanic scripts, in order to work in known contexts of language, script, image resolution and text size. While these processes have had certain success when used with clean images, they have not been successful when dealing with images having clutter.
Zheng et al., “Machine Printed Text And Handwriting Identification In Noisy Document Images,” IEEE Trans. Pattern anal. Mach. Intell., 26(3):337-353, 2004, emphasized classifying regions of pixels (roughly text words) into one of the following categories: machine print text, handwritten text, noise. Zheng et al. employed a large number of features, selected according to discriminative ability for classification. The results are post processed using a Markov Random Field that enforces neighborhood relationships of text words.
Chen et al., “Image Objects And Multi-Scale Features For Annotation Detection”, in Proceedings of International Conference on Pattern Recognition, Tampa Bay, Fla., 2008, focused on the selecting the right level of segmentation through a multiscale hierarchical segmentation scheme.
Koyama et al., “Local-Spectrum-Based Distinction Between Handwritten And Machine-Printed Characters”, in Proceedings of the 2008 IEEE International Conference On Image Processing, San Diego, Calif., October 2008, used local texture features to classify small regions of an image into machine-printed or handwritten.