The following relates generally to methods, apparatus and articles of manufacture for determining logical document structure, such as the reading or viewing order of a document.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit them to be used other than for viewing or printing. Reasons for this restriction include, among others, the unavailability of the document in its native format (e.g., only a scanned original of a document or a lower-level representation exists), or the deprecation or disappearance of the document's original authoring environment (e.g., document editors that are no longer available or which are inoperable on existing software platforms).
The recovery of document content (e.g., characters, words, etc.) and logical structure (e.g., viewing and reading order) thus form the basis for effective document reuse. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation (e.g., PDF or Postscript representation), a loss of logical document structure usually results because the representation of the document is either at a very low level (e.g., bitmap) or an intermediate level (e.g., a document formatted in a page description language or a portable document format).
Geometric (or physical) page layout analysis can be used to recognize the different elements of a page, often in terms of text regions and image regions. Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image (i.e., layout objects). Such methods exploit the geometric or typographical features of document image objects, sometimes using of the content of objects and a priori knowledge of page layout for a particular document class. One particular problem which arises in this process is in the context of documents with pages which are arranged in columns. It would be desirable to identify the column structure of a page so that the textual content can be extracted in the correct order for reading.
One method for segmenting layout objects of a document image where columns may be present is known as the XY-cut method (see G. Nagy, S. Seth, and M. Viswanathan, “A prototype document image analysis system for technical journals,” Computer 7(25):10-22 (1992)). This method involves finding the widest cut or the widest empty rectangle (or valley) that crosses the entire page (or block), either vertically or horizontally. The page is then segmented into blocks, which are sized to fit their content. Other methods are described in U.S. Pat. No. 5,784,487 to Cooperman and U.S. Pat. No. 7,392,473 to Meunier (hereinafter, Meunier), incorporated herein by reference; and in the following references: Roger C. Parker, The Aldus Guide to Basic Design, Aldus Corporation (1988); H. S. Baird, “Background structure in document images,” in H. Bunke, P. Wang, and H. S. Baird, Eds., Document Image Analysis, pages 17-34, World Scientific, Singapore (1994); L. O'Gorman, “The document spectrum for page layout analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence 15(11):1162-1173 (1993); K. Kise, et al., “Segmentation of page images using the area Voronoi diagram,” Computer Vision and Image Understanding 70(3):370-382 (1998); and Faisal Shafait, et al., “Structural Mixtures for Statistical Layout Analysis,” Proc. 8th Intl. Workshop on Document Analysis Systems (2008). In general, these methods take as input a page and perform a segmentation of the content into homogeneous regions (text or image). Approaches are either top-down, such as in the X-Y cut method, or bottom-up, as in Kise, et al., and O'Gorman. Some methods such as Nagy, et al., can generate hierarchical relations among generated blocks. Meunier describes a generate-and-test approach related to the XY cut method of Nagy, et al. These methods, however, often fail to segment a page correctly due to an automatically computed threshold which is used to define a column gutter (the strip of white space between two columns). The value of this gutter (its width) is usually based on the inter-word space. The applied threshold can prevent recognition of some columns with smaller gutter widths.
There remains a need for a method for segmenting pages into columns which copes with a variety of page layouts.