The exemplary embodiment relates to a system and method for the unsupervised generation of page templates applicable to a given document or set of documents.
A document is a collection of one or more pages, where each page can be considered to contain zero or more page elements (such as page headers, footers, images, etc). The arrangement of page elements within a single page define a page layout.
Geometric page layout analysis (GPLA) is often the first step in Document Analysis and Recognition. GPLA algorithms recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy et al. (A prototype document image analysis system for technical journals. Computer, 7(25): 10-22, 1992) and the Smearing algorithm, described by Wong et al. (Document analysis system. IBM Journal of Research and Development, 26(6):647-656, 1982). These GPLA algorithms receive as input a page image and perform a segmentation based on information (such as pixel information) gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. Some methods such as the X-Y Cut algorithm can generate hierarchical relations among recognized blocks.
The typical output of the GPLA algorithms is a page layout that specifies the geometry of the maximal homogeneous regions contained within the page and the spatial relationships between the homogenous regions. A region is homogeneous if all its area is of one type, such as text, images, etc. However, the page layouts produced by the GPLA algorithms are specific to a page in a document rather than applicable across multiple pages.
Many applications use page templates to segment or categorize a page. In these situations, the page templates are provided as prior knowledge to the application and are not inferred or generated from the document itself. The page templates may be used to categorize pages conforming to the page templates (such as business letters vs. tax forms) or to label the page elements, where the labeling of the page elements is performed by matching the page elements against a known page template. However, all of these applications require a priori knowledge about the structure and composition of the page templates in order to detect and label page elements. For example, many of the applications require user provided page templates. The requirement of this a priori page template knowledge can be problematic since it may be difficult for a user to generates the templates. Also, many of the GPLA algorithms require manual annotation of at least some of the page elements within a given document. Therefore, it would be useful to have a method that allows for the automatic generation of formal page templates according to pages in a document without a priori knowledge of any page template composition.
Additionally, it would be advantageous to have a method of describing the composition of a page template based on the geometric relationships between the labeled elements of the document pages.