The exemplary embodiments relate to document processing and find particular application in connection with a method and system for generation of document structures associated with a document page.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit them to be used other than for viewing or printing. To provide greater accessibility to the content of such documents, it is desirable to understand their structure. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation (e.g., PDF or Postscript representation), a loss of document structure usually results because the representation of the document is either at a very low level (e.g., bitmap) or an intermediate level (e.g., a document formatted in a page description language or a portable document format).
Geometric (or physical) page layout analysis can be used to recognize the different elements of a page, often in terms of text regions and image regions. Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image (i.e., layout objects). Such methods exploit the geometric or typographical features of document image objects, sometimes using of the content of objects and a priori knowledge of page layout for a particular document class. Geometric page layout analysis (GPLA) algorithms have been developed to recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy, et al. (A prototype document image analysis system for technical journals. Computer, 7(25): 10-22, 1992) and the Smearing algorithm, described by Wong, et al. (Document analysis system. IBM Journal of Research and Development, 26(6):647-656, 1982). These GPLA algorithms receive as input a page image and perform a segmentation based on information, such as pixel information, gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. These methods are useful for segmenting pages one dimensionally, into columns.
Provided here is a method that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams (sequence of n features) are computed from a set for n contiguous elements, and n-grams which are repetitive (Kleene cross) are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.