The following relates generally to methods, apparatus and articles of manufacture therefor, for determining logical document structure, such as, the reading or viewing order of a document.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit that they be used other than for viewing or printing. Reasons for this restriction include, among others, the unavailability of the document in its native format (e.g., only scanned original of a document or lower-level representation exists), or the deprecation or disappearance of the document's original authoring environment (e.g., document editors that are no longer sold or operate on existing software platforms).
The recovery of document content (e.g., characters, words, etc.) and logical structure (e.g., viewing and reading order) form the basis for effective document reuse, beyond applications such as viewing and printing. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation (e.g., PDF or Postscript representation), a loss of logical document structure usually results because the representations of the document is either at a very low level (e.g., bitmap) or an intermediate level (e.g., a document formatted in a page description language or a portable document format).
The logical organization of objects in electronic documents recorded in low-level or intermediate-level representations may lose certain high-level representations (e.g., that permit editing of high-level constructs) because they have been optimized for their particular application, such as printing, display, or storage. For example, the order in which objects forming a document formatted in a print-oriented or storage-oriented file format may be optimized for printing or storage rather than the logical order of the objects in the document. In order to achieve certain print, storage, or display efficiencies, electronic documents recorded in optimized print, storage, or display formats may dispose of high-level constructs or group elements of a document together in an order that appears out of its logical flow.
In contrast, hardcopy documents converted to an electronic form by scanning lose their document structure unless augmented with a high-level description (see for example U.S. Pat. No. 5,486,686, which is incorporated herein in its entirety by reference). Optical Character Recognition (OCR) may be used for recovering and recognizing objects in a document image to identify low-level representations (e.g., at the character or word level) or intermediate-level representations (e.g., formatting, paragraphs and object detection) of a document image. In addition, there exists methods for recovering certain aspects of a document's high-level representation to allow applications that rely on a document's logical structure to operate or automatically process its content, such as, document editors and document readers.
Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image (i.e., “layout objects”). Such known methods exploit the geometric or typographical features of document image objects, together with or without the use of the content of objects and a priori knowledge for a particular document class. Such known methods are described, for example, in the following publications, which are incorporated herein by reference: R. Cattoni, T. Coianiz, S. Messelodi, C. M. Modena, “Geometric Layout Analysis Techniques for Document Image Understanding: a Review”, ITC-IRST Technical Report #9703-09, 1998; Y. Ishitani, “Document Transformation System from Papers to XML Data Based on Pivot XML Document Method”, International conference on document analysis and recognition (ICDAR), 2003; G. Nagy and S. Seth, “Hierarchical representation of optically scanned documents”, Proceedings of the 7th International Conference On Pattern Recognition, pp. 347-349, 1984; Jaekyu Ha, R. M. Haralick, I. T. Phillips, “Recursive X-Y cut using bounding boxes of connected components”, International Conference on Document Analysis and Recognition (ICDAR), Vol. 2, 1995; and A. K. Jain, M. N. Myrthy, and P. J. Flynn, “Data clustering: A survey”, ACM Computing Survey, 31(3):264-323, 1999.
One such known method for segmenting layout objects of a document image is known as the XY-cut method (see Nagy and Seth cited above). Briefly in one embodiment, the method consists in finding the widest cut or the widest empty rectangle (or valley) that crosses the entire page (or block), either vertically or horizontally. The page is then segmented into blocks, which are sized to fit their content. The method is applied recursively to each block, until no valleys remain. In one embodiment of the XY-cut method, bounding boxes of connected components of black pixels are relied on, in place of, image pixel data.
FIG. 1 illustrates an example of page segmentation using the XY-cut method. In FIG. 1, the document image or page 104 has five layout objects (shown with cross-etched fill). When the XY-cut method is performed the first block or page 104 with Y-cut (or horizontal-cut) valley 106 is segmented into block 108 with X-cut (or vertical-cut) valley 110 and block 112 with X-cut (or vertical-cut) valley 114. The XY-cut method repeats until the layout objects on the page 104 are segmented into blocks 1 through 5, as shown in FIG. 1.
While the XY-cut strategy illustrated in FIG. 1 to cut the widest empty rectangle at each recursion works well for layout object segmentation on a page image, the strategy is less adapted for determining the reading order of layout objects on the page image, which reading order may be deduced from the cut hierarchy (e.g., for top-to-bottom and left-to-right reading order, with a vertical cut, the content on the left side of the cut comes before the content on the right side of the cut, and with a horizontal cut, the content on the top side of the cut comes before the content on the bottom side of the cut). For example, when employing the cutting strategy illustrated in FIG. 1 on a two column document page to determine the correct reading order of layout objects on the document page, an error may occur if the page is horizontally cut before cutting vertically along column separations.
There continues to exist, therefore, a need for an improved method for determining the logical ordering of layout objects on a document image, to properly order the content of the layout objects as it would be read by a person when the layout objects of a document image have no ordering (e.g., a scanned bitmap image) or have an incorrect ordering (e.g., are in an order optimized for printing, storing, or display). It would be advantageous if such a method is deterministic and efficient when the method processes a document image with numerous fine-grain layout objects and the layout objects present multiple alternatives in which a page document may be cut along column or row separations.
In accordance with the disclosure herein, there is provided a method for ordering layout objects of a document to determine their logical or semantic (i.e., reading) order. The method is adapted to exploit the geometric features of a document image, thereby advantageously permitting the method to be applied to various classes of documents, such as, documents expressed in various languages. The method may operate with layout objects of document images of various granularities, as the layout objects may contain one or more of letters, words, lines, or paragraphs. The layout objects may, for example, include combinations of textual content and image content.
In accordance with the various embodiments disclosed herein, there is provided a method, apparatus and article of manufacture therefor, for determining a logical order of a document, comprising: (a) assigning a page of the document to be a block having a width along a first direction (e.g., horizontal) and a length along a second direction (e.g., vertical) perpendicular to the first direction; the block having a plurality of layout objects arranged therein; (b) identifying a first set of hypothetical cuts, substantially between layout object boundaries, that span the width of the block; the first set of hypothetical cuts defining a set of sub-blocks with each sub-block having a width along the first direction and a length along the second direction; (c) identifying a second set of hypothetical cuts, substantially between layout object boundaries, that span the length of sub-blocks in the set of sub-blocks; (d) computing arrangement criteria of layout objects ordered according to the first and the second sets of hypothetical cuts; (e) modifying cuts in the first and second sets of hypothetical cuts, using the computed arrangement criteria, to merge cuts that span two or more sub-blocks along the second direction; (f) determining the logical order of the document using cuts between layout objects in the block remaining in the first and second sets of hypothetical cuts after performing (e).