The present invention is generally directed to techniques for analyzing image data, and more particularly to the analysis of image data representing images containing text to classify the types of non-running text regions therein without the need for predefining structure within the image. The invention first employs the characteristics of running text regions to distinguish them from non-running text regions in a page image. As the input of the present invention, there is assumed to be digital representation of a hardcopy document that contains text regions. In addition, the image will have been processed by any one of a number of well-know processes for segmenting text regions in an image, and identifying the boundaries of such regions. While it is understood that such boundaries may be represented by a polynomial, it will be assumed that such boundaries are rectangular in shape for purposes of simplifying the description of the instant invention.
This invention first presents a geometric, bottom-up method for partitioning a scanned page into two kinds of regions: (i) regions encompassing running text, i.e., text formatted in paragraphs and columns; and (ii) regions encompassing text formatted in other layout structures, such as headings, lists, and tables. This partitioning further, supports the present invention in the classification of the non-running text regions so as to enable selective scanning for text recognition within non-running text regions. For example, table detection, crucial for establishing reading order, is done by testing the non-running text regions for alignment and similar relationships in order to classify the image.
Page layout analysis processes may be divided into two broad categories, geometric layout analysis and logical structure analysis (R. Haralick. Document image understanding: geometric and logical layout. Proc. IEEE Conf. On Computer Vision and Pattern Recognition, 1994: 385-390). Geometric layout analysis, also termed bottom-up analysis, extracts whatever structure can be inferred without reference to models of particular kinds of pages - e.g., letter, memo, title page, table, etc. Intuitively, this is the structure that would be apparent even to an illiterate person. Also, it is the structure common to pages of all kinds. Logical structure analysis. classifies a given page within a repertoire of known layouts, and assigns functional interpretations to components of the page based on this classification. Geometric analysis is generally preliminary to logical structure analysis.
Bottom-up analysis schemes attempt to segment a page into homogeneous regions of text, line art (graphics), and photographs (halftone images), and then stop. This is normally taken to be the highest level of structure that can be established bottom-up. However, the present invention establishes that a yet higher level of structure can be extracted using a bottom-up analysis technique, and that this can benefit subsequent logical structure analysis.
Heretofore, a number of patents and publications have disclosed methods for segmenting images and the identification of structure therein, the relevant portions of which may be briefly summarized as follows:
Text block segmentation has been addressed by R. Haralick, "Document image understanding: geometric and logical layout," Proc. IEEE Conf. On Computer Vision and Pattern Recognition, 1994: 385-390. Component aggregation or clustering methods that assemble homogeneous regions from individual connected components subject to size similarity and proximity constraints were described by L. O'Gorman, "The document spectrum for bottom-up page layout analysis," Advances in structural and syntactic pattern recognition, Ed. H. Bunke, Singapore: World Scientific, 1992: 270-279. Background structure methods detect bands of white space (gutters) in the image and treat these as the boundaries of text blocks as described by H. S. Baird, "Background structure in document images," Advances in structural and syntactic pattern recognition, Ed. H. Eiunke, Singapore: World Scientific, 1992: 253-269; T. Pavlidis and J. Zhou, "Page segmentation by white streams,"Proc. 1st lnt. Conf. On Document Recognition, Saint-Malo, 1991: 945-953; and A. Antonacopoulos and R. T. Ritchins, "Flexible page segmentation using the background," Proc. 12th Int. Conf. On Pattern Recognition, 1994: 339-344.
Top-down logical structure analysis processes do attempt to distinguish types of text, but do so using a priori layout models as taught, for example, by G. Nagy, S. Seth, and S. Stoddard, "Document analysis with an expert system," Pattern Recognition in Practice II, E Gelsema and L. Kanal, editors, North Holland, Amsterdam, 1986: 149-159; and R. Ingold and D. Armangil, "A top-down document analysis method for logical structure recognition," Proc. 1st Int. Conf. On Document recognition, Saint.-Malo, 1991: 41 49.
J. Fisher, in "Logical structure descriptions of segmented document images," Proc. 1st Int. Conf. On Document recognition, Saint-Malo, 1991: 302-310, describes a rule-based system that identifies geometrical and logical structures of document images. Location cues, format cues and textual cues (OCR) are employed to make identifications and transformations during the identification of text and non-text regions.
Specific analysis of tabular formatted text is described by S. Chandran and R. Kasturi, "Structural recognition of tabulated data," Proc. 2nd Int. Conf. On Document Recognition, 1993: 516-519; H. Kojima and T. Akiyama, "Table recognition for automated document entry system," Proc. SPIE Vol. 1384 High-Speed Inspection Architectures, Barcoding, and Character Recognition, 1994: 285-292; and M. A. Rahgozar, Z. Fan, and E. V. Rainero, "Tabular document recognition," Proc. SPIE Vol. 2181 Document Recognition, 1994: 87-96.
U.S. Pat. No. 5,239,596 to Mahoney, issued Aug. 24, 1993, incorporated herein by reference for its teachings, describes techniques for labeling pixels in an image based upon nearest neighbor attributes.
In accordance with the present invention, there is provided a method comprising the steps of: retrieving an input image, the image comprising an array of image signals and associated data defining a set of boundaries of a plurality of text-blocks represented therein, and storing the array of image signals in a bitmap array and the data defining the set of boundaries in a second array; partitioning the text-blocks defined by the set of boundaries stored in the second array into text groups; classifying the text-groups to determine those text-groups which represent running text regions of the image and those which represent non-running text regions of the image; regrouping at least one non-running text region of the image based upon locations of the text blocks within the non-running text region; and further classifying a non-running text region as to the extent to which such a text region is tabularized.
In accordance with another aspect of the present invention, there is provided a method operating on a programmable computer for partitioning an image containing text into regions of running text and non-running text, the image consisting essentially of an array of image signals and associated data defining a set of boundaries of a plurality of text- blocks represented therein, said method comprising the steps of: retrieving an input image and storing image signals thereof in a first bitmap array memory location and the data defining the set of boundaries in a second memory location; partitioning the text blocks, using data defining the boundaries stored in the second memory location, into text groups; classifying the text groups to determine those text groups which represent running text regions of the image and those which represent non-running text regions of the image; regrouping at least one non-running text region of the image based upon locations of the text blocks within said non-running text region; and further classifying at least one text group representing a non-running text region as to the extent to which the text group is tabularized.
In accordance with yet another aspect of the present invention, there is provided an apparatus, comprising a first memory for storing image data; a second memory for storing data repressenting characteristics of an image, the bitmap data for said image being stored in said first memory array; instruction memory; a text processor, connected to said first and second memory and said instruction memory for accessing the data stored in the first and second memory in accordance with instructions stored in said instruction memory, the processor in executing the instructions:
accessing the image data stored in the first memory location to produce text block boundaries representing text blocks in the image, the data defining the text block boundaries being stored in the second memory as image characteristic data; partitioning the text-blocks defined by the boundaries stored in the second memory location into text groups;
classifying the text groups to determine those text groups which represent running text regions of the image and those which represent non-running text regions of the image; regrouping at least one non-running text region of the image based upon locations of the text blocks within said non-running text region; and further classifying, in response to instructions stored in said instruction memory, at least one text group representing a non-running text region as to the extent to which the non-running text group is tabularized.
One aspect of the present invention deals with a basic problem in image recognition--that of applying structural norms in a top-down analysis. In top-down or logical structure analysis, a given page image is analyzed based upon functional interpretations of components of the page. In the more generalized geometric analysis techniques employed by the present invention, it is possible to extract a similar level of structure in a bottom-up approach. This aspect is further based on the discovery of a technique that alleviates this problem by partitioning a page image into two principal types of regions: running text and non-running text. Once the non-running text regions of the document are identified, they may be further analyzed in accordance with the present invention to identify tabular regions in the non-running text.
The techniques employed in practicing the present invention are advantageous because they not only avoid problems with conventional top-down or logical structure analysis methods, but they allow the further characterization of non-running text regions identified within an image. Thus, the present invention supports format analysis and selective scanning for text recognition. For example, the detection of tables in a scanned page is crucial for establishing proper reading order during optical character recognition. Moreover, such information is necessary to subsequent manipulation of the tabularized information.