The exemplary embodiment relates to document processing and finds particular application in connection with a system and a method for extracting regular geometric structures in a document page.
While the use of electronically created and recorded documents is prevalent, many such electronic documents are in a form that does not permit them to be used other than for viewing or printing. To provide greater accessibility to the content of such documents, it is desirable to understand their logical structure. However, when electronic documents are recovered by scanning a hardcopy representation or by recovering an electronic representation (e.g., PDF or Postscript representation), a loss of logical document structure usually results because the representation of the document is either at a very low level (e.g., bitmap) or an intermediate level (e.g., a document formatted in a page description language or a portable document format).
Geometric (or physical) page layout analysis can be used to recognize the different elements of a page, often in terms of text regions and image regions. Methods are known for determining a document's logical structure, or the order in which objects are laid out on a document image (i.e., layout objects). Such methods exploit the geometric or typographical features of document image objects, sometimes using of the content of objects and a priori knowledge of page layout for a particular document class. Geometric page layout analysis (GPLA) algorithms have been developed to recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy, et al. (A prototype document image analysis system for technical journals. Computer, 7(25): 10-22, 1992) and the Smearing algorithm, described by Wong, et al. (Document analysis system. IBM Journal of Research and Development, 26(6):647-656, 1982). These GPLA algorithms receive as input a page image and perform a segmentation based on information (such as pixel information) gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. While such methods have been useful for segmenting pages one dimensionally, into columns, identifying geometric structures that are two dimensional in nature, such as tables, has proved more difficult. Many approaches to this problem search for graphical lines which delimit the table content (see, for example, Zanibbi, et al., A Survey of Table Recognition: Models, Observations, Transformations, and Inferences. School of Computing, Queen's University, Kingston, Ontario, Canada, Report K7L 3N6, Oct. 24, 2002). However, many tables do not include such clear pointers to their structure. Additionally, other two-dimensional regular geometric structures may be encountered which lack features of conventional tables that would otherwise facilitate detection.
The exemplary system and method address these problems and other by facilitating automated detection of regular geometric structures.