The present invention is generally directed to techniques for analyzing image data, and more particularly to the analysis of image data representing images containing text to identify running and non-running text regions therein without assuming a specific page layout. The invention employs the characteristics of running text regions to distinguish them from non-running text regions in a page image. As the input of the present invention, there is assumed to be digital representation of a hardcopy document that contains text regions. In addition, the image will have been processed by any one of a number of well-know processes for segmenting text regions in an image, and identifying the boundaries of such regions. While it is understood that such boundaries may be represented by a polynomial, it will be assumed that such boundaries are rectangular in shape for purposes of simplifying the description of the instant invention.
This invention presents a geometric, bottom-up method for partitioning a scanned page into two kinds of regions: (i) regions encompassing running text, i.e., text formatted in paragraphs and columns; and (ii) regions encompassing text formatted in other layout structures, such as headings, lists, and tables. This partitioning supports classification of the non-running text regions and selective scanning for text recognition. For example, table detection, crucial for establishing reading order, is done by testing the non-running text regions for alignment relations.
Page layout analysis processes may be divided into two broad categories: geometric layout analysis and logical structure analysis (R. Haralick. Document image understanding: geometric and logical layout. Proc. IEEE Conf. On Computer Vision and Pattern Recognition, 1994: 385-390). Geometric layout analysis, also termed bottom-up analysis, extracts whatever structure can be inferred without reference to models of particular kinds of pages - e.g., letter, memo, title page, table, etc. Intuitively, this is the structure that would be apparent even to an illiterate person. Also, it is the structure common to pages of all kinds. Logical structure analysis classifies a given page within a repertoire of known layouts, and assigns functional interpretations to components of the page based on this classification. Geometric analysis is generally preliminary to logical structure analysis.
Bottom-up analysis schemes attempt to segment a page into homogeneous regions of text, line art (graphics), and photographs (halftone images), and then stop. This is normally taken to be the highest level of structure that can be established bottom-up. However, the present invention establishes that a yet higher level of structure can be extracted using a bottom-up analysis technique, and that this can benefit subsequent logical structure analysis.
Heretofore, a number of patents and publications have disclosed methods for segmenting images and the identification of structure therein, the relevant portions of which may be briefly summarized as follows:
Text block segmentation has been addressed by R. Haralick, "Document image understanding: geometric and logical layout," Proc. IEEE Conf. On Computer Vision and Pattern Recognition, 1994: 385-390. Component aggregation or clustering methods that assemble homogeneous regions from individual connected components subject to size similarity and proximity constraints were described by L. O'Gorman, "The document spectrum for bottom-up page layout analysis," Advances in structural and syntactic pattern recognition, Ed. H. Bunke, Singapore: World Scientific, 1992: 270-279. Background structure methods detect bands of white space (gutters) in the image and treat these as the boundaries of text blocks as described by H. S. Baird, "Background structure in document images," Advances in structural and syntactic pattern recognition, Ed. H. Bunke, Singapore: World Scientific, 1992: 253-269; T. Pavlidis and J. Zhou, "Page segmentation by white streams," Proc. 1st Int. Conf. On Document Recognition, Saint-Malo, 1991: 945-953; and A. Antonacopoulos and R. T. Ritchins, "Flexible page segmentation using the background," Proc. 12th Int. Conf. On Pattern Recognition, 1994: 339-344.
Top-down logical structure analysis processes do attempt to distinguish types of text, but do so using a priori layout models as taught, for example, by G. Nagy, S. Seth, and S. Stoddard, "Document analysis with an expert system," Pattern Recognition in Practice II, E. Gelsema and L. Kanal, editors, North Holland, Amsterdam, 1986: 149-159; and R. Ingold and D. Armangil, "A top-down document analysis method for logical structure recognition," Proc. 1st Int. Conf. On Document recognition, Saint-Malo, 1991: 41-49.
J. Fisher, in "Logical structure descriptions of segmented document images," Proc. 1st Int. Conf. On Document recognition, Saint-Malo, 1991: 302-310, describes a rule-based system that identifies geometrical and logical structures of document images. Location cues, format cues and textual cues (OCR) are employed to make identifications and transformations during the identification of text and non-text regions.
U.S. Pat. No. 5,239,596 to Mahoney, issued Aug. 24, 1993, incorporated herein by reference for its teachings, describes techniques for labeling pixels in an image based upon nearest neighbor attributes.
In accordance with the present invention, there is provided a method comprising: retrieving an input image, the image comprising an array of image signals and associated data defining a set of boundaries of a plurality of text-blocks represented therein, and storing the array of image signals in a bitmap array and the data defining the set of boundaries in a second array; partitioning the text-blocks defined by the set of boundaries stored in the second array into text groups; and classifying the text-groups to determine those text-groups which represent running text regions of the image and those which represent non-running text regions of the image.
In accordance with another aspect of the present invention, there is provided a method operating on a programmable computer for partitioning an image containing text into regions of running text and non-running text, the image consisting essentially of an array of image signals and associated data defining a set of boundaries of a plurality of text-blocks represented therein, said method comprising the steps of: retrieving an input image and storing image signals thereof in a first bitmap array memory location and the data defining the set of boundaries in a second memory location; partitioning the text-blocks defined by the boundaries stored in the second memory location into text groups; and classifying the text groups to determine those text groups which represent running text regions of the image and those which represent non-running text regions of the image.
In accordance with yet another aspect of the present invention, there is provided an apparatus, comprising: a first memory for storing image data; a second memory for storing data representing characteristics of an image, the bitmap data for said image being stored in said first memory array; instruction memory; a text processor, connected to said first and second memory and said instruction memory for accessing the data stored in the first and second memory in accordance with instructions stored in said instruction memory, the processor in executing the instructions: accessing the image data stored in the first memory location to produce text block boundaries representing text blocks in the image, the data defining the text block boundaries being stored in the second memory as image characteristic data; partitioning the text-blocks defined by the boundaries stored in the second memory location into text groups; and classifying the text-groups to determine those text-groups which represent running text regions of the image and those which represent non-running text regions of the image.
One aspect of the present invention deals with a basic problem in image recognition-that of applying structural norms in a top-down analysis. In top-down or logical structure analysis, a given page image is analyzed based upon functional interpretations of components of the page. In the more generalized geometric analysis techniques employed by the present invention, it is possible to extract a similar level of structure in a bottom-up approach. This aspect is further based on the discovery of a technique that alleviates this problem. The technique partitions a page image into two principal types of regions: running text and non-running text. The technique, using four separate analysis phases, detects text blocks, groups the text blocks, extracts text blocks that represent running text and regroups regions of contiguous, non-running text blocks, with the output representing a list or similar representation of running and non-running regions of the page image. Such output may be further processed to identify tabular regions in the non-running text.
The techniques employed in the present invention are advantageous to the analysis of a document (page image), because they provide important information about the layout structure of the page without the assumption of any specific page layout. Conventional top-down analysis techniques typically depend on such specific assumptions and, therefore, cannot be applied as generally. This invention has practical advantages in optical character recognition systems and may also be employed in document search and retrieval systems wherein a user might select a document category to focus the search. In the latter, the classification of a page image in accordance with the present invention would enable the document type to be accurately classified (e.g., memo, technical report, journal article, etc.). Accordingly, a wide variety of operations can be implemented using the output of these techniques.