The invention is in the field of processing documents and, more specifically, processing for the purpose of identifying areas on a document as text, halftone, graphics, etc. Such processing is sometimes called page segmentation and can be a part of a larger process or system such as character recognition or data compression.
Page segmentation can be a desirable pre-processing step in document analysis systems. For example, Wahl, M., et al., BLOCK SEGMENTATION AND TEXT EXTRACTION IN MIXED TEXT/IMAGE DOCUMENTS, Computer Vision, Graphics, and Image Processing, Academic Press, 1982, pp. 375-390 discuss a constrained run length algorithm (CRLA) for partitioning documents into areas of text lines, solid black lines and rectangular boxes enclosing graphics and half-tone images, and state that the proposed process labels these areas and calculates meaningful features. The paper discusses a linear adaptive classification scheme which makes use of the regular appearance of text lines as textured stripes in order to distinguish text regions from other regions. Additional material in this field is cited under the heading References at page 390. The CRLA discussed by Wahl, M., et al. carries out a bi-level digitization of scanlines into 0's and 1's and then replaces 0's with 1's if the number of adjacent 0's is less than or equal to a predetermined constraint C, such as C=2. This one-dimensional bitstring operation is applied line-by-line as well as column-by-column to the two-dimensional bitmap of the input document. See FIGS. 1a-1c in the paper. The resulting intermediate bitmaps then are combined by a logical AND operation, to give the result illustrated in FIG. 1d in the paper. In order to remove small gaps in text lines, an additional nonlinear horizontal smoothing is carried out by means of the same CRLA but this time with higher C, as with C.sub.sm =30, to give the result illustrated in FIG. 1e in the paper.
One desirable characteristic of page segmentation is robustness with respect to tilt or skew between the scanlines and the lines of text, because document scanners sometimes skew a sheet and because printed lines are not always perpendicular to the feed direction. Another desirable characteristic is low requirement for processing power, so that the page segmentation stage of the overall process can be fast, preferably less than a second or two per page, and so that it would not require particularly expensive computing equipment.
It is believed that many of the known page segmentation processes, including that discussed in Wahl, M., et al., need to assume that the printed page is made primarily of rectangular blocks with sides parallel to the paper edges. Of course, this assumption may not be valid when the page is skewed relative to the scanline direction or when the print lines on the paper are skewed relative to the edges of the paper. There are discussions in the literature of accounting for skewing, e.g., by pre-processing to derive a skew correction and taking this correction into account in subsequent processing, or by using a Hough transformation. See Baird, H., et al., IMAGE SEGMENTATION BY SHAPE-DIRECTED COVERS, IEEE Proc. 10th ICPR, Atlantic City, N.J., pp 820-825, June 1990 and Hinds, S. C., et al., A DOCUMENT SKEW DETECTION METHOD USING RUN LENGTH ENCODING AND THE HOUGH TRANSFORMATION, IEEE Proc. 10th ICPR, Atlantic City, N.J., pp 464-468, June 1990. (Neither of these two papers is necessarily prior art to this invention.) However, such pre-processing can be time consuming and expensive.
Accordingly, an object of the invention is to achieve page segmentation and/or block classification which overcomes or at least reduces the limitations and disadvantages of proposals of the type referred to above, and to achieve this result through a process that is robust with respect to skew and at the same time is fast and does not require excessive computing power.
In order to achieve fast and economical page segmentation, the invention makes use of the recognition that from a distance text areas on a page tend to look grey and this general property could be used to distinguish quickly between text and blank areas on the page. The invention makes use of additional criteria for rapid and economical discrimination between text areas and other areas that also could look grey from a distance, such as some halftone and graphs.
In order to join coherent intervals or areas, known earlier proposals such as that discussed in Wahl, M., et al. have relied on the assumption of a rectangular structure, thus making the techniques sensitive to tilt or skew. The preferred embodiment of this invention uses graph connecting exemplified by the line adjacency graph (LAG) technique discussed earlier but modified in accordance with the invention to join grey intervals of scan lines into grey areas and to join grey areas that should be joined. See, e.g., the publication by the named inventor Pavlidis, T., A VECTORIZER AND FEATURE EXTRACTOR FOR DOCUMENT RECOGNITION, Computer Vision, Graphics, and Image Processing 35, 111-127 (1986). As used in a preferred embodiment of the invention, the nodes of the LAG correspond to grey intervals and the edges of the LAG join nodes in adjacent scanlines when the corresponding grey intervals of the scanlines would overlap if the two scanlines were overplayed on each other. Then, graph traversal, preferably but not necessarily breadth first graph traversal, is used to construct grey areas. See, also, the references cited at pages 126-7 of the named inventor's article.
The information resulting from page segmentation can be used in processes such as character recognition, e.g., as discussed in Kahan, S., et al., ON THE RECOGNITION OF PRINTED CHARACTERS OF ANY FONT AND SIZE, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 2, Mar. 1987, pp. 274-288. See, also, the references cited at page 287 of the paper.
In a particular embodiment, the invention makes use of the recognition that after a document is digitized into a bitmap, using for example line scanning and bi-level digitization, the grey areas of interest tend to be characterized by closely spaced short black intervals along the respective scanlines. To discriminate grey areas that are likely to be text from those that are likely to be halftone, the invention makes use of properties such as whether correlation between scanlines varies with distance between scanlines. Such correlation tends to decrease with distance between scanlines in text but to remain relatively constant in halftone.
An exemplary and non-limiting process in accordance with the invention can be implemented by scanning a document along scanlines to detect "black" and "white" segments along the respective scanlines, where black and white can be defined with respect to a selected threshold. These black and white segments along a scanline are examined to detect "grey intervals" which can be defined as intervals that are between long white intervals or between a long white interval and an edge of the document. A long white interval can be defined as a white run length of over a certain size. Alternately, a long white interval can be defined as a sequence of white run lengths separated from each other by very short black run lengths, where "very short" can be defined in absolute terms (e.g., a black run length of a pixel or two) or in relative terms (e.g., a black run length that is a small percentage, such as a few percent of the preceding and/or succeeding white run length). A grey interval can be made up of closely spaced black run lengths that are short or long or it could be only a single black run length. A scanline through a character tends to produce such a grey interval. The grey intervals of scanlines can be associated with each other in the scan direction as well as in a direction transverse to the scanline direction (cross-scan direction) to identify "grey areas" defined as areas in which grey intervals are closely spaced. If the scanlines are horizontal, a process embodying the invention can find grey intervals along the respective scanlines and then associate grey intervals into grey areas using the modified LAG followed by graph traversal as earlier discussed. The scanlines used for page segmentation in accordance with the invention need not be as close to each other as those typically used for character recognition. The process can be speeded up considerably by using only every n--th, e.g., every 10--th, of the scanlines used for character recognition. Grey areas that are more likely to be text than, say, halftone, can be identified in accordance with the invention by testing the relationship between correlation of scanlines and distance between scanlines. Earlier known applications of the LAG are believed to have used the "depth first traversal" technique. A preferred embodiment of the invention uses a "breadth first traversal" at this stage of implementing the overall invented process.
Page segmentation in accordance with the invention is believed to be significantly more robust with respect to skew as compared with known prior processes that assume the absence of skew, and is believed to be considerably faster than known prior processes that pre-correct for skew. It is believed that a process in accordance with the invention could typically do page segmentation within about 2 seconds per page using equipment with the computing power of a current generation SPARC workstation.