A wide variety of applications may require processing of documents to perform contextual data interpretation. As will be appreciated, document processing may typically involve conversion of a paper or electronic document into electronic information (that is, data) that may be worked upon. Further, as will be appreciated, in many digital documents (for example, financial reports, product documents, scientific articles, or the like), the data may be presented in tabular structures (rows and columns) for facilitating ease of presentation and interpretation. For example, such tabular structures may allow an author of a document to present information in a structured manner so as to summarize and communicate key results. Further, such tabular structures may enable readers of the document to get a quick overview of the presented information and to compare them with other similar information in a specific context. Additionally, the tabular formats are increasingly used by analysts for data mining, information retrieval, trend analysis and other tasks. It is, therefore, necessary to detect and extract such tabular data from the document for further processing, such as for contextual data interpretation.
However, such detection and extraction of tabular data from a document may be challenging due to a large variability in tabular structure layouts, tabular structure styles, information type and format in the tabular structure, and further due to a lack of standard document formats. For example, while data is presented in the tabular formats, the heights of rows and columns may be different, cells may have been merged (that is, each row may not have same number of column or each column may not have same number of rows), the borders of table and lines distinguishing the cells may be different, the cells may be distinguished by various colors and background patterns rather than lines, the table may include nested tables with multiple table headers, tables may be arranged in a hierarchical order, and so forth.
Existing techniques provide for tabular data detection and extraction using partitioning, clustering of words inside tables, boundary detection, set of pre-developed rules, scoring techniques, annotation, and so forth. However, existing techniques for identification and extraction of tabular data and other such related tasks are limited in their effectiveness, robustness, and efficiency due to their inability to deal with a vast variation in the formats and structures of the tabular data.