The present exemplary embodiments relate to systems and methods for segmenting text lines in documents, and the use of the segmented text in the determination of marking types in documents.
An automated electronic based system having the capability for such detection has uses in a number of environments. For example, in legal document discovery it is valuable for lawyers to be able to quickly narrow down, from millions of pages, those pages which have been marked on. Also, in automated data extraction, absence of handwritten marks in a signature box can be translated to mean the absence of a signature. Further, being able to tell noise marks apart from machine printed marks can lead to better segmentation for optical character recognition (OCR). It is therefore envisioned one area the present system will find use is in the context of forms, where printed or handwritten text may overlap machine printed rules and lines.
Identifying granular noise (sometimes called salt and pepper noise), line graphics, and machine print text have received the most attention in document image analysis literature. The dominant approaches have relied on certain predictable characteristics of each of these kinds of markings. For example, connected components of pixels that are smaller than a certain size are assumed to be noise; large regions of dark pixels are assumed to be shadows; and long straight runs of pixels are assumed to come from line graphics. Identification of machine print text is an even more difficult task. In commercial OCR packages, systems for the detection of machine printed regions have been heavily hand-tuned, especially for Romanic scripts, in order to work in known contexts of language, script, image resolution and text size. While these processes have had certain success when used with clean images, they have not been successful when dealing with images having clutter.
Zheng et al., “Machine Printed Text And Handwriting Identification In Noisy Document Images,” IEEE Trans. Pattern anal. Mach. Intell., 26(3):337-353, 2004, emphasized classifying regions of pixels (roughly text words) into one of the following categories: machine print text, handwritten text, noise. Zheng et al. employed a large number of features, selected according to discriminative ability for classification. The results are post processed using a Markov Random Field that enforces neighborhood relationships of text words.
Chen et al., “Image Objects And Multi-Scale Features For Annotation Detection”, in Proceedings of International Conference on Pattern Recognition, Tampa Bay, Fla., 2008, focused on the selecting the right level of segmentation through a multiscale hierarchical segmentation scheme.
Koyama et al., “Local-Spectrum-Based Distinction Between Handwritten And Machine-Printed Characters”, in Proceedings of the 2008 IEEE International Conference On Image Processing, San Diego, Calif., October 2008, used local texture features to classify small regions of an image into machine-printed or handwritten.
FIG. 1 shows a portion of a document 100 containing machine graphics 102, machine printed text 104, and handwriting 106. Various applications require separating and labeling these and other different kinds of markings.
A common intermediate step in the art is to form connected components. A problem arises when connected components contain mixed types of markings, especially when machine printed and handwritten text touch graphics, such as rule lines, or touch handwritten annotations that are not part of a given text line. Then, correct parsing requires breaking connected components into smaller fragments. One example is a signature that sprawls across the printed text of a form or letter. Another example is seen in FIG. 1 where the handwritten numbers 106 extend over the machine printed text 102.
FIG. 2 shows connected components (e.g., a sampling identified as 108a-108n) of FIG. 1 in terms of bounding boxes (e.g., a sampling identified as 110a-110n). Clearly many of these connected components can and do include mixtures of marking types. The problem is to break these into smaller meaningful units suitable for grouping and classifying into smaller meaningful units suitable for grouping and classifying into correct types.
One method for breaking connected components into smaller fragments is recursive splitting is discussed on commonly assigned U.S. Patent Publication No. US-2011-0007366-A1, published Jan. 13, 2011, to Saund et al., entitled, “System And Method For Classifying Connected Groups Of Foreground Pixels In Scanned Document Images According To The Type Of Marking”.
Another approach is described by Thomas Breuel in “Segmentation Of Handprinted Letter Strings Using A Dynamic Programming Algorithm”, in Proceedings of Sixth International Conference on Document Analysis and Recognition, pages 821-6, 2001.
Still another concept for breaking connected components into smaller fragments is disclosed in U.S. Pat. No. 6,411,733, Saund, “Method and apparatus for separating document image object types.” This applies mainly to separating pictures and large objects from text from line art. It does not focus on separating small text from small line art or graphics.