A wide variety of applications may require processing of documents to perform contextual data interpretation. As will be appreciated, document processing may typically involve conversion of a paper or electronic document into electronic information (that is, data) that may be worked upon. Further, as will be appreciated, for any document processing technique, an important task may be extraction of a structure of the document. A knowledge of logical structure of the document may help in accurate extraction of data. A logical layout structure may include the classification of the structural blocks of the document into headers, footers, title, paragraphs, section headers, footnotes, references, table of content, and the like.
However, the documents generally do not follow a standardized structure, thereby making extraction of data from the documents a challenging task. For example, portable document format (PDF) is one of the most common formats for documents today. Though the PDF format is optimized for presentation, it typically lacks structural information. Further, there may be different application programming interfaces (APIs) and encoding techniques used to build a PDF document. Thus, when a program wants to extract structural information, there are no standard tags or properties based on which various sections or structural blocks of document may be identified. In any large-scale application, the document processing techniques may have to cope with a large number of variations in layout of the documents and the problem may be further amplified.
Existing techniques for identification of structural blocks within the documents scope are limited in their scope, utility, and application. For example, one of the existing techniques provide for segmenting the document images into maximal homogenous regions and identifying them as such as texts, images, tables, and ruling lines. Though this technique is useful to segment blocks of document as images, texts, tables, actual structure of the text content may not be determined. In particular, the text content may not be differentiated into paragraphs, title, footnotes, and the like. Another existing technique provides for identification of paragraphs automatically in different languages and domains. However, this technique provides for identification of only paragraphs within the document. One of the other existing techniques provide for a rule-based approach for understanding structure of the document. The technique further provides for an exchangeable rule base adaptable to several domains. However, a rule based approach may work well only when the layout of the document is consistent, but may fail for a set of documents with large variations. Further, for an image document (for example, a scanned document) that has been converted to text using optical text recognition tool, a lot of the features like font, spacing, and the like may be lost, thereby leading to a failure of rule based techniques.