A wide variety of applications may require processing of documents to perform contextual data interpretation. As will be appreciated, document processing may typically involve conversion of a paper or electronic document into electronic information (that is, data) that may be worked upon. Further, as will be appreciated, for any document processing technique, an important task may be extraction of a structure of the document. A knowledge of logical structure of the document may help in accurate extraction of data. A logical layout structure may include the classification of the structural blocks of the document into headers, footers, title, paragraphs, section headers, footnotes, references, table of content, and the like.
However, the documents generally do not follow a standardized structure, thereby making extraction of data from the documents a challenging task. For example, portable document format (PDF) is one of the most common formats for documents today. Though the PDF format is optimized for presentation, it typically lacks structural information. Further, there may be different application programming interfaces (APIs) and encoding techniques used to build a PDF document. Thus, when a program wants to extract structural information, there are no standard tags or properties based on which various sections or structural blocks of document may be identified. In any large-scale application, the document processing techniques may have to cope with a large number of variations in layout of the documents and the problem may be further amplified.
Existing techniques for identification of structural blocks within the documents scope are limited in their scope, utility, and application. For example, one of the existing techniques provide for structure extraction from a corpus of financial reports. The technique extracts headers in the document, and then, using the extracted headers as bookmarks, extracts narrative section under each heading. This technique may prove useful in extraction of structure of a document pertaining to financial domain but may not be applicable to a document pertaining to other domains. Another existing technique provides for extraction and classification of a document page layout structure by analyzing the spatial configuration of the bounding boxes of different entities on a given document image. The technique segments the document image into a list of homogeneous regions and classify them into texts, images, tables, line-drawings, halftones, ruling lines, or noise. Though this technique is useful to segment blocks of document as images, texts, or tables, the actual structure of the text content may not be determined. In particular, the text content may not be differentiated into paragraphs, title, footnotes, and the like. Further, this technique is limited in its application to image documents which are present in a structured manner as it makes use of spatial configuration. One of the other existing techniques provide for a template based approach for extracting logical structures of a document. In particular, the techniques provide for a framework for the specification of logical structures as templates and the extraction of their instances from rich text documents. However, the template based approach may work well only when the layout (i.e., logical structure) of the document is consistent, but may fail for a set of documents with large variations. Further, a new template will have to be specified or introduced for any new document structure.